Combining values Boolean columns to one with Priority in R - r

Gone through below links but it solved my problem partially.
merge multiple TRUE/FALSE columns into one
Combining a matrix of TRUE/FALSE into one
R: Converting multiple boolean columns to single factor column
I have a dataframe which looks like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
A = c('Y','N','N','N','N','N','N','N'),
B = c('N','Y','N','N','N','N','Y','N'),
C = c('N','N','Y','N','N','Y','N','N'),
D = c('N','N','N','Y','N','Y','N','N'),
E = c('N','N','N','N','Y','N','Y','N')
)
I want to make a reshape my df with one column but it has to give priorities when there are 2 "Y" in a row.
THE priority is A>B>C>D>E which means if their is "Y" in A then the resultant value should be A. Similarly, in above example df both C and D has "Y" but there should be "C" in the resultant df.
Hence output should look like:
resultant_dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Result = c('A','B','C','D','E','C','B','NA')
)
I have tried this:
library(reshape2)
new_df <- melt(dat, "Id", variable.name = "Result")
new_df <-new_df[new_df$value == "Y", c("Id", "Result")]
But the problem is doesn't handle the priority thing, it creates 2 rows for the same Id.

tmp = data.frame(ID = dat[,1],
Result = col_order[apply(
X = dat[col_order],
MARGIN = 1,
FUN = function(x) which(x == "Y")[1])],
stringsAsFactors = FALSE)
tmp$Result[is.na(tmp$Result)] = "Not Present"
tmp
# ID Result
#1 1 A
#2 2 B
#3 3 C
#4 4 D
#5 5 E
#6 6 C
#7 7 B
#8 8 Not Present

Related

Taking a subset of a main dataset based on the values of another data frame that is a subset of the main data frame

I have these two datasets : df as the main data frame and g as a created data frame
df = data.frame(x = seq(1,20,2),y = letters[1:10] )
df
g = data.frame(xx = c(2,3,4,5,7,8,9) )
and I want to take a subset of the data frame df based on the values xx of the data frame g as follows
m = df[df$x==g$xx,]
but the result is based on the match between the two data frames for the order of the matched values. not the matched values themselves.
output
> m
x y
2 3 b
I don't what the error I am making.
Maybe you need to use %in% instead of ==
> df[df$x %in% g$xx,]
x y
2 3 b
3 5 c
4 7 d
5 9 e
You can also use inner_join from dplyr:
library(dplyr)
df %>%
inner_join(g, by = c("x" = "xx"))
intersect can be useful too
df[intersect(df$x, g$xx),]
using merge
merge(df, g, by.x = "x", by.y = 'xx')
x y
1 3 b
2 5 c
3 7 d
4 9 e

Trying to produce a loop for summing up consecutive column values in R

I am trying to produce an loop function to sum up consecutive columns of values of a table and output them into another table
For example, in my original table, we have columns a, b, c, etc, which contain the same number of numeric values.
The resulting table then should be a, a+b, a+b+c, etc up to the last column of the original table
I have a feeling a for loop should be sufficient for this operation however can't get my head around the format and syntax.
Any help would be appreciated!
Since you're new, here is an example of a very minimal minimal reproducible example?
library(data.table)
x = data.table(a=1:3,b=4:6,c=7:9)
for(... now what?
And here's a way to do your task:
library(data.table)
# make some dummy data
X = data.table(a=1:2,b=3:4,c=5:6)
# make an empty result table
Y = data.table()
# for i = 1 to the number of columns in X
for(i in 1:ncol(X)){
# colnames(X) is "a" "b" "c".
# colnames(X)[1:1] is "a", colnames(X)[1:2] is "a" "b", colnames(X)[1:3] is "a" "b" "c"
# paste0(colnames(X)[1:1],collapse='') is "a",
# paste0(colnames(X)[1:2],collapse='') is "ab",
# paste0(colnames(X)[1:3],collapse='') is "abc"
newcolname = paste0(colnames(X)[1:i],collapse='')
# Y[,(newcolname):= is data.table syntax to create a new column called newcolname
# X[,1:i] selects columns 1 to i
# rowSums calculates the, um, row sums :D
Y[,(newcolname):=rowSums(X[,1:i])]
}
Maybe you need Reduce like below
cbind(
df,
setNames(
as.data.frame(Reduce(`+`, df, accumulate = TRUE)),
Reduce(paste0, names(df), accumulate = TRUE)
)
)
such that
a b c a ab abc
1 1 4 7 1 5 12
2 2 5 8 2 7 15
3 3 6 9 3 9 18
Data
df <- structure(list(a = 1:3, b = 4:6, c = 7:9), class = "data.frame", row.names = c(NA,
-3L))

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

Reshape origin destination data

I need to turn this data frame :
df1 <- data.frame(A = c(1,2,3), B = c(2,1,4), Flow = c(50,30,20))
into a data frame like this :
df2 <- data.frame(A = c(1,3), B = c(3,4), AtoB = c(50,20), BtoA = c(20, NA))
I am trying to reshape it with dplyr. Is there an existing function or a way to do that ?
An option would be to create an Identifier column between 'A' and 'B' with labels 'AtoB/BtoA' based on the minimum value in each row, then change the values in 'A', 'B' by taking the min/max for each row (pmin/pmax) and spread the output back to 'wide' format
library(dplyr)
library(tidyr)
df1 %>%
mutate(grpIdent = case_when(A == pmin(A, B) ~ 'AtoB', TRUE ~ 'BtoA'),
A1= pmin(A, B), B1 = pmax(A, B)) %>%
select(A = A1, B = B1, grpIdent, Flow) %>%
spread(grpIdent, Flow)
# A B AtoB BtoA
#1 1 2 50 30
#2 3 4 20 NA
Using base R(This might require introducing a blank or blanks). It is also assumed that the to and fro- values are entered in succession.
new_df<-cbind(df[seq(1,nrow(df), by=2),], df[seq(2,nrow(df), by=2),])[,-c(4,5)]
names(new_df)<-c("A","B","AtoB","BtoA")
new_df
Result:
# A B AtoB BtoA
#1 1 2 50 30
#3 3 4 20 30

split dataframe in R by row

I have a long dataframe like this:
Row Conc group
1 2.5 A
2 3.0 A
3 4.6 B
4 5.0 B
5 3.2 C
6 4.2 C
7 5.3 D
8 3.4 D
...
The actual data have hundreds of row. I would like to split A to C, and D. I looked up the web and found several solutions but not applicable to my case.
How to split a data frame?
For example:
Case 1:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
I don't want to split by arbitrary number
Case 2: Split by level/factor
data2 <- data[data$sum_points == 2500, ]
I don't want to split by a single factor either. Sometimes I want to combine many levels together.
Case 3: select by row number
newdf <- mydf[1:3,]
The actual data have hundreds of rows. I don't know the row number. I just know the level I would like to split at.
It sounds like you want two data frames, where one has (A,B,C) in it and one has just D. In that case you could do
Data1 <- subset(Data, group %in% c("A","B","C"))
Data2 <- subset(Data, group=="D")
Correct me if you were asking something different
For those who end up here through internet search engines time after time, the answer to the question in the title is:
x <- data.frame(num = 1:26, let = letters, LET = LETTERS)
split(x, sort(as.numeric(rownames(x))))
Assuming that your data frame has numerically ordered row names. Also split(x, rownames(x)) works, but the result is rearranged.
You may consider using the recode() function from the "car" package.
# Load the library and make up some sample data
library(car)
set.seed(1)
dat <- data.frame(Row = 1:100,
Conc = runif(100, 0, 10),
group = sample(LETTERS[1:10], 100, replace = TRUE))
Currently, dat$group contains the upper case letters A to J. Imagine we wanted the following four groups:
"one" = A, B, C
"two" = D, E, J
"three" = F, I
"four" = G, H
Now, use recode() (note the semicolon and the nested quotes).
recodes <- recode(dat$group,
'c("A", "B", "C") = "one";
c("D", "E", "J") = "two";
c("F", "I") = "three";
c("G", "H") = "four"')
split(dat, recodes)
With base R, we can input the factor that we want to split on.
split(df, df$group == "D")
Output
$`FALSE`
Row Conc group
1 1 2.5 A
2 2 3.0 A
3 3 4.6 B
4 4 5.0 B
5 5 3.2 C
6 6 4.2 C
$`TRUE`
Row Conc group
7 7 5.3 D
8 8 3.4 D
If you wanted to split on multiple letters, then we could:
split(df, df$group %in% c("A", "D"))
Another option is to use group_split from dplyr, but will need to make a grouping variable first for the split.
library(dplyr)
df %>%
mutate(spl = ifelse(group == "D", 1, 0)) %>%
group_split(spl, .keep = FALSE)

Resources