Is it possible to merge rows in R data.frame? - r

If I have the following data.frame:
> df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))
> df
x y
1 a d
2 b* e
3 c f
Is there a clear way to identify rows in which the df$x entries include the string value *, then use this condition to force the string entries of that row to be merged with the row preceding itself, resulting in a data.frame like the following:
> df
x y
1 a b* d e
2 c f
I assume that the first part of the problem (identifying the x row values that include `*) can be done in a fairly straightforward way using regular expressions. I'm having trouble identifying how to force a data.frame row merge with the row preceding it.
One particularly tricky challenge is if multiple entries in a row have the pattern, e.g.
> df <- data.frame(x = c('a', 'b*', 'c*'), y = c('d', 'e', 'f'))
> df
x y
1 a d
2 b* e
3 c* f
In this case, the resulting data.frame should look like this:
> df
x y
1 a b* c* d e f
The main issue that I find is that after running one iteration of a loop that pastes the strings from df[2,] into df[1,], the data.frame index does not adapt to the new data.frame size:
> df
x y
1 a b* d e
3 c* f
So, subsequent indexing is disrupted.

Here a initial solution:
# Creating the data frame
df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'),stringsAsFactors = FALSE)
df
# Creating a vector of rows with *
ast <- grepl("\\*",df$x)
# For loop
for(i in seq(length(ast),1,-1)){
if(ast[i]){
df[i-1,"x"] <- paste(df[i-1,"x"],df[i,"x"],sep=" ")
df[i-1,"y"] <- paste(df[i-1,"y"],df[i,"y"],sep=" ")
df <- df[-i,]
}
}
That's an initial solution because you still have to manage when the first row has * and other situations like this. I hope that helps already.

Here are 3 alternatives (for the base R one, I assumed x and y are characters rather factor. I also made your data more complicated in order to cover different scenarios)
(A bit more complicated data set)
df <- data.frame(x = c('p','a', 'b*', 'c*', 'd', 'h*', 'j*', 'l*', 'n'),
y = c('r','d', 'e', 'f', 'g', 'i', 'k', 'm', 'o'),
stringsAsFactors = FALSE)
Base R
aggregate(. ~ ID,
transform(df, ID = cumsum(!grepl("*", x, fixed = TRUE))),
paste, collapse = " ")
# ID x y
# 1 1 p r
# 2 2 a b* c* d e f
# 3 3 d h* j* l* g i k m
# 4 4 n o
data.table
library(data.table)
setDT(df)[, lapply(.SD, paste, collapse = " "),
by = .(ID = cumsum(!grepl("*", df[["x"]], fixed = TRUE)))]
# ID x y
# 1: 1 p r
# 2: 2 a b* c* d e f
# 3: 3 d h* j* l* g i k m
# 4: 4 n o
dplyr
library(dplyr)
df %>%
group_by(ID = cumsum(!grepl("*", x, fixed = TRUE))) %>%
summarise_all(funs(paste(., collapse = " ")))
# # A tibble: 4 x 3
# ID x y
# <int> <chr> <chr>
# 1 1 p r
# 2 2 a b* c* d e f
# 3 3 d h* j* l* g i k m
# 4 4 n o

Not actually merging the rows, but for those rows that have a * it pastes the value of the previous row in, and then it gets rid of rows that had a * in the following row.
library(dplyr)
df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))
df <- mutate(df,
Operator = grepl("\\*",x), # Check for *
lagged.x = lag(x, n = 1), # Get x value from 1 row ago
lagged.y = lag(y, n = 1), # Get y value from 1 row ago
x = ifelse(Operator, paste(lagged.x, x),x), # if there is * paste lagged x
y = ifelse(Operator, paste(lagged.y, y),y), # if there is * paste lagged y
lead.Operator = lead(Operator, n = 1) # Check if next row has a *
)
# keep only rows that had no * in following row and that had no following row (last row)
df <- filter(df, !lead.Operator | is.na(lead.Operator))
# Select just the x and y columns
df <- select(df, x, y)

Related

How to create pairs from a single column counting the occurrence in R?

So I'm working on creating an edges file for a social network analysis based on IMDb data.
And I've run into a problem and I can't figure out how to fix it as I'm new to R.
Assuming I have the following dataframe:
movieID <- c('A', 'A','A', 'B','B', 'C','C', 'C')
crewID <- c('Z', 'Y', 'X', 'Z','V','V', 'X', 'Y')
rating <- c('7.3','7.3', '7.3', '2.1', '2.1', '9.0','9.0', '9.0')
df <- data.frame(movieID, crewID, rating)
movieID
CrewID
Rating
A
Z
7.3
A
Y
7.3
A
X
7.3
B
Z
2.1
B
V
2.1
C
V
9.0
C
X
9.0
C
Y
9.0
I am trying to build unique pairs of CrewIDs within a movie with a weight that equals the occurrence of that pair, meaning how often these two crew members have worked on a movie together. So basically I want a dataframe like the following as a result:
CrewID1
CrewID2
weight
(not a col but explanation)
Z
Y
1
together once in movie A
Z
X
1
together once in movie A
Y
X
2
together twice in movies A and C
Z
V
1
together once in movie B
V
X
1
together once in movie C
V
Y
1
together once in movie C
The pairs (Z,Y) and (Y,Z) are equal to each other as I don't care about direction.
I found the following StackOverflow thread on a similar issue:
How to create pairs from a single column based on order of occurrence in R?
However in my case this skips the combination (V,Y) and (X,Z) and the count for (X,Y) is still 1 and I can't figure out how to fix it.
m <- crossprod(table(df[-3]))
m[upper.tri(m, diag = TRUE)] <-0
subset(as.data.frame.table(m), Freq > 0)
CrewID CrewID.1 Freq
2 X V 1
3 Y V 1
4 Z V 1
7 Y X 2
8 Z X 1
12 Z Y 1
Maybe not the most efficient solution but this would be one way of doing it:
# Define a function that generates pairs of ids
make_pairs <- function(data){
# Extract all ids in the movie
data$crew %>%
# Organize them alphabetically
sort() %>%
# Generate all unique pairs
combn(2) %>%
# Prep for map
as.data.frame() %>%
# Generate pairs as single string
purrr::map_chr(str_flatten, '_')
}
# Generate the data
tibble::tibble(
movie = c('A', 'A', 'A', 'B','B', "C", 'C', 'C'),
crew = c('Z', 'Y', 'X', 'Z', 'V', 'V', 'X', 'Y')
) %>%
# Nest the data so all ids in one movie gets put together
tidyr::nest(data = -movie) %>%
# Generate pairs of interactions
dplyr::mutate(
pairs = purrr::map(data, make_pairs)
) %>%
# Expand all pairs
tidyr::unnest(cols = pairs) %>%
# Separate them into unique colums
tidyr::separate(pairs, c('id1', 'id2')) %>%
# Count the number of times two ids co-occure
dplyr::count(id1, id2)
# A tibble: 6 x 3
id1 id2 n
<chr> <chr> <int>
1 V X 1
2 V Y 1
3 V Z 1
4 X Y 2
5 X Z 1
6 Y Z 1

copy rows based on the number of a variables (R) [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
So, I have a dataset with 2 columns X and Y. Y is an integer between 0 and 5. I need to change the level of the detail of the dataset.
I want to copy the rows the number of times Y indicates As an example
X | Y
______
a | 1
b | 0
c | 2
Becomes
X |
___
a |
c |
c |
a remains once, b disappears and c appears now twice. I do not need to keep the Y number, except in the number of rows of X.
My first thought was to do
df4 <- df %>% filter (Y=4)
df4 <- rbind(df4, df4, df4, df4) %>% select (-Y)
but that all seems ugly, and it is not generalizable to Y =20 as an example.
Thank you!
We could use uncount
library(dplyr)
library(tidyr)
df %>%
uncount(Y) %>%
as_tibble
-output
# A tibble: 3 x 1
# X
# <chr>
#1 a
#2 c
#3 c
or in base R with rep
df[rep(seq_len(nrow(df)), df$Y),'X', drop = FALSE]
data
df <- data.frame(X = c('a', 'b', 'c'), Y = c(1, 0, 2))
May be this:
df <- data.frame( 'x' = c('a', 'b', 'c'), 'y'= c(1, 0, 2))
rep(df$x, df$y)
or
## For a dataframe:
df[match(rep(df$x, df$y), df$x),'x', drop=FALSE]
Output:
R>rep(df$x, df$y)
[1] "a" "c" "c"
What about this?
data.frame(
X = with(
df,
rep(X, Y)
)
)
which gives
X
1 a
2 c
3 c

How to get the sum of the product of selected column in a data frame?

This is probably very simple but I couldn't think of a solution.
I have the following data frame, and I want to multiply column y with column z and sum the answer.
> df <- data.frame(x = c(1,2,3), y = c(2,4,6), z = c(2,3,4))
> df
x y z
1 1 2 2
2 2 4 3
3 3 6 4
The value found should be equal to 40.
with would be an option here if we don't want to repeat df$ or df[[ to extract the column
with(df, sum( y * z))
#[1] 40
Or %*%
c(df$y %*% df$z)
Additionally, you could use data table. The second row after the comma indicates columns (j). You don't need the spaces, they're just there to show how it works.
library(data.table)
a <- data.table(x = c(1,2,3), y = c(2,4,6), z = c(2,3,4))
#dt i j by
a[ , sum(y*z), ]

How to obtain minimum difference between 2 columns

I want to obtain the minimum distance between 2 columns, however the same name may appear in both Column A and Column B. See example below;
Patient1 Patient2 Distance
A B 8
A C 11
A D 19
A E 23
B F 6
C G 25
So the output I need is:
Patient Patient_closest_distance Distance
A B 8
B F 6
c A 11
I have tried using the list function
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
However, I just get the minimum distance for each column, i.e. C will have 2 results as it is in both columns rather than showing the closest patient considering both columns. Also, I only get a list of distances, so I can't see which patient is linked to which;
Patient1 SNP
1: A 8
I have tried using the list function in R Studio
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
This code below works.
# Create sample data frame
df <- data.frame(
Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);
# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)
# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))
# Resort, min to top.
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]

Data frame: How to compare current row to some other rows without looping?

I have the following df and use-case, I'd like to find and set something in all rows for which exist another row satisfying a condition e.g.
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'))
> df
X Y
1 a a
2 b c
3 c d
I'd like to find those rows whos Y value is the same as X value in another row. In the example above would be row #2 is true because Y = c and row #3 has X = c. Note that row #1 does not satisfy the condition.
Something like:
df$Flag <- find(df, Y == X_in_another_row(df))
1
For each Y, we check if any value in X (other than in the same row) matches.
sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i])
#[1] FALSE TRUE FALSE
If indices are necessary, wrap the whole thing in which
which(sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i]))
#[1] 2
2 (not tested well)
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = FALSE)
temp = outer(df$X, df$Y, "==") #Check equality among values of X and Y
diag(temp) = FALSE #Set diagonal values as FALSE (for same row)
colSums(temp) > 0
#[1] FALSE TRUE FALSE
which(match(df$Y,df$X)!=1:nrow(df))
I think this should work.
df <- data.frame(X= c(1,2,3,4,5,3,2,1), Y = c(1,2,3,4,5,6,7,8))
which(with(df, (X %in% Y) & (X != Y)))
Works on the original data.frame, if we set stringsasfactors=FALSE
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = F)
which(with(df, (X %in% Y) & (X != Y)))
Quite convoluted but I'll put it here anyway. This should work even if there are repeated values in X.
For example with the following dataframe df2:
df2 = data.frame(X=c('a','b','c','a','d'), Y=c('a','c','d','e','b'))
X Y
1 a a
2 b c
3 c d
4 a e
5 d b
## Specifying the same factor levels allows us to get a square matrix
df2$X = factor(df2$X,levels=union(df2$X,df2$Y))
df2$Y = factor(df2$Y,levels=union(df2$X,df2$Y))
m = as.matrix(table(df2))
valY = rowSums(m)*colSums(m)-diag(m)
which(df2$Y %in% names(valY)[as.logical(valY)])
[1] 1 2 3 5
Essentially you want to know whether Y is in X but you want the condition to be FALSE when X == Y:
df$Z <- with(df, (Y != X) & (Y %in% X))
# Assume you want to use position 4, value 'c', to find all the rows that Y is 'c'
df <- data.frame(X = c('a', 'b', 'd', 'c'),
Y = c('a', 'c', 'c', 'd'))
row <- 4 # assume the desire row is position 4
val <- as.character( df[(row),'X'] ) # get the character and turn it into character type
df[df$Y == val,]
# Result
# X Y
# 2 b c
# 3 d c

Resources