So let's say that I have two data frames. So for example:
a <- c(10,20,30,40)
b <- c('b', 'p', 't', 'x')
c <- c(TRUE,FALSE,TRUE,FALSE)
d <- c(2.5, 8, 10, 7)
df1 <- data.frame(a,b,c,d)
e<-c(2.5,2.5,8,8,8,10,10,10)
f<-c(T, T, F, F, F, T, F, T)
df2<- data.frame(e,f)
I know that all the values of column e in dataframe 2 are contained in column d of dataframe 1.
I want to be able to place column b into dataframe 2 so that it would look like this:
e<-c(2.5,2.5,8,8,8,10,10,10)
f<-c(T, T, F, F, F, T, F, T)
b<-b("b", "b", "p", "p", "p", "t", "t", "t")
df2<- data.frame(e,f,c)
That is, where a value in column e in dataframe 2 is equal to a value in column d of dataframe 1, I want to place the value of column C corresponding to that value in column D into a new column in Dataframe 2.
In reality, I am using much larger datasets than this, so I am hoping for something that does this in a timely manner(i.e preferably not nested for loops). Any help would be greatly appreciated!
We can do a merge in base R
merge(df2, df1[c('b', 'd')], by.x = 'e', by.y = 'd')
Another base R solution using match
df2$b <- df1$b[match(df2$e,df1$d)]
which gives
> df2
e f b
1 2.5 TRUE b
2 2.5 TRUE b
3 8.0 FALSE p
4 8.0 FALSE p
5 8.0 FALSE p
6 10.0 TRUE t
7 10.0 FALSE t
8 10.0 TRUE t
Related
I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets
How to use column index to dplyr::left_join (and your family)?
Example (by column names):
library(dplyr)
data1 <- data.frame(var1 = c("a", "b", "c"), var2 = c("d", "d", "f"))
data2 = data.frame(alpha = c("d", "f"), beta = c(20, 30))
left_join(data1, data2, by = c("var2" = "alpha"))
However, replacing by = c("var2" = "alpha")) to by = c(data1[,2] = data2[,1]) results to this error:
by must be a (named) character vector, list, or NULL for natural
joins (not recommended in production code), not logical.
I need to use the "column position" for loop on new functions.
How can I do it?
Using dplyr:
# rename_at changes alpha into var2 in data2
left_join(data1, rename_at(data2, 1, ~ names(data1)[2]), by = names(data1)[2])
# output
var1 var2 beta
1 a d 20
2 b d 20
3 c f 30
Using base R:
merge(data1, data2, by.x = 2, by.y = 1, all.x = T, all.y = F)
# output
var2 var1 beta
1 d a 20
2 d b 20
3 f c 30
I don't know how you're going to use the column index but a hacky solution is the following:
#make a named vector for the by argument, see ?left_join
join_var <- names(data2)[1] #change index here based on data2
names(join_var) <- names(data1)[2] #change index here based on data1
left_join(data1, data2, by = join_var)
Depending on the final output you desire by using the column index, there is probably a more appropriate solution than this.
Gone through below links but it solved my problem partially.
merge multiple TRUE/FALSE columns into one
Combining a matrix of TRUE/FALSE into one
R: Converting multiple boolean columns to single factor column
I have a dataframe which looks like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
A = c('Y','N','N','N','N','N','N','N'),
B = c('N','Y','N','N','N','N','Y','N'),
C = c('N','N','Y','N','N','Y','N','N'),
D = c('N','N','N','Y','N','Y','N','N'),
E = c('N','N','N','N','Y','N','Y','N')
)
I want to make a reshape my df with one column but it has to give priorities when there are 2 "Y" in a row.
THE priority is A>B>C>D>E which means if their is "Y" in A then the resultant value should be A. Similarly, in above example df both C and D has "Y" but there should be "C" in the resultant df.
Hence output should look like:
resultant_dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Result = c('A','B','C','D','E','C','B','NA')
)
I have tried this:
library(reshape2)
new_df <- melt(dat, "Id", variable.name = "Result")
new_df <-new_df[new_df$value == "Y", c("Id", "Result")]
But the problem is doesn't handle the priority thing, it creates 2 rows for the same Id.
tmp = data.frame(ID = dat[,1],
Result = col_order[apply(
X = dat[col_order],
MARGIN = 1,
FUN = function(x) which(x == "Y")[1])],
stringsAsFactors = FALSE)
tmp$Result[is.na(tmp$Result)] = "Not Present"
tmp
# ID Result
#1 1 A
#2 2 B
#3 3 C
#4 4 D
#5 5 E
#6 6 C
#7 7 B
#8 8 Not Present
I have two data frames. I would like to take a subset of the first data frame considering only the columns for which the first values is equal to the first value of the rows of the second data frame.
Example
Data Frame 1:
columns_df1 : a b c d e
Data Frame 2:
rows_df2 : a c e
Subset I would like to obtain:
final_columns_df1 = a c e
I am stuck on how to compare columns with rows belonging to two different data frames.
Thanks for your help!
Ok. It's a little unclear what you want from your question as you don't provide a full reproducible answer. But I think this is what you're looking for.
df1 <- data.frame(a = c(1, 2),
b = c(3, 4),
c = c(5, 6),
d = c(7, 8),
e = c(9, 10))
df2 <- data.frame(f = c("a", "b"),
g = c("c", "d"),
h = c("e", "f"))
final_columns_df1 <- df1[ , names(df1) %in% df2[1, ]]
final_columns_df1
a c e
1 1 5 9
2 2 6 10
I have a long dataframe like this:
Row Conc group
1 2.5 A
2 3.0 A
3 4.6 B
4 5.0 B
5 3.2 C
6 4.2 C
7 5.3 D
8 3.4 D
...
The actual data have hundreds of row. I would like to split A to C, and D. I looked up the web and found several solutions but not applicable to my case.
How to split a data frame?
For example:
Case 1:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
I don't want to split by arbitrary number
Case 2: Split by level/factor
data2 <- data[data$sum_points == 2500, ]
I don't want to split by a single factor either. Sometimes I want to combine many levels together.
Case 3: select by row number
newdf <- mydf[1:3,]
The actual data have hundreds of rows. I don't know the row number. I just know the level I would like to split at.
It sounds like you want two data frames, where one has (A,B,C) in it and one has just D. In that case you could do
Data1 <- subset(Data, group %in% c("A","B","C"))
Data2 <- subset(Data, group=="D")
Correct me if you were asking something different
For those who end up here through internet search engines time after time, the answer to the question in the title is:
x <- data.frame(num = 1:26, let = letters, LET = LETTERS)
split(x, sort(as.numeric(rownames(x))))
Assuming that your data frame has numerically ordered row names. Also split(x, rownames(x)) works, but the result is rearranged.
You may consider using the recode() function from the "car" package.
# Load the library and make up some sample data
library(car)
set.seed(1)
dat <- data.frame(Row = 1:100,
Conc = runif(100, 0, 10),
group = sample(LETTERS[1:10], 100, replace = TRUE))
Currently, dat$group contains the upper case letters A to J. Imagine we wanted the following four groups:
"one" = A, B, C
"two" = D, E, J
"three" = F, I
"four" = G, H
Now, use recode() (note the semicolon and the nested quotes).
recodes <- recode(dat$group,
'c("A", "B", "C") = "one";
c("D", "E", "J") = "two";
c("F", "I") = "three";
c("G", "H") = "four"')
split(dat, recodes)
With base R, we can input the factor that we want to split on.
split(df, df$group == "D")
Output
$`FALSE`
Row Conc group
1 1 2.5 A
2 2 3.0 A
3 3 4.6 B
4 4 5.0 B
5 5 3.2 C
6 6 4.2 C
$`TRUE`
Row Conc group
7 7 5.3 D
8 8 3.4 D
If you wanted to split on multiple letters, then we could:
split(df, df$group %in% c("A", "D"))
Another option is to use group_split from dplyr, but will need to make a grouping variable first for the split.
library(dplyr)
df %>%
mutate(spl = ifelse(group == "D", 1, 0)) %>%
group_split(spl, .keep = FALSE)