copy rows based on the number of a variables (R) [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
So, I have a dataset with 2 columns X and Y. Y is an integer between 0 and 5. I need to change the level of the detail of the dataset.
I want to copy the rows the number of times Y indicates As an example
X | Y
______
a | 1
b | 0
c | 2
Becomes
X |
___
a |
c |
c |
a remains once, b disappears and c appears now twice. I do not need to keep the Y number, except in the number of rows of X.
My first thought was to do
df4 <- df %>% filter (Y=4)
df4 <- rbind(df4, df4, df4, df4) %>% select (-Y)
but that all seems ugly, and it is not generalizable to Y =20 as an example.
Thank you!

We could use uncount
library(dplyr)
library(tidyr)
df %>%
uncount(Y) %>%
as_tibble
-output
# A tibble: 3 x 1
# X
# <chr>
#1 a
#2 c
#3 c
or in base R with rep
df[rep(seq_len(nrow(df)), df$Y),'X', drop = FALSE]
data
df <- data.frame(X = c('a', 'b', 'c'), Y = c(1, 0, 2))

May be this:
df <- data.frame( 'x' = c('a', 'b', 'c'), 'y'= c(1, 0, 2))
rep(df$x, df$y)
or
## For a dataframe:
df[match(rep(df$x, df$y), df$x),'x', drop=FALSE]
Output:
R>rep(df$x, df$y)
[1] "a" "c" "c"

What about this?
data.frame(
X = with(
df,
rep(X, Y)
)
)
which gives
X
1 a
2 c
3 c

Related

R find index of a variable and subset a lsit

I have a list that looks like this
#Make dataframes
df1 = data.frame(x = c("a", "b", "c"), y = 1:3, stringsAsFactors = F)
df2 = df1 %>% mutate(y = y*2)
df3 = df1 %>% mutate(y = y*3)
#Make a name for each dataframe
myvar = "fname"
#Combine name and dataframe into a list
mylist = list(myvar, df1)
#Add the other dataframes and name to the list (done in a loop for my bigger dataset
list2 = list(myvar, df2)
mylist = rbind(mylist, list2)
list3 = list(myvar, df3)
mylist = rbind(mylist, list3)
I want to pull a subset of the list with all the data associated with "c"
x y
3 c 3
x y
3 c 6
x y
3 c 9
This is what I tried but it doesn't work
#Find all instances of "c"
picksite = "c"
site_indices = which(mylist[,2] == picksite)
mylist[site_indices,]
Any suggestions on how to do this, or even a link to better understand lists? Thanks so much.
Wrapping the which inside of lapply will solve this problem:
lapply(mylist[,2], FUN = function(i) i[which(i$x == "c"),])
$mylist
x y
3 c 3
$list2
x y
3 c 6
$list3
x y
3 c 9
Using tidyverse, we can loop over the list with map and use if_any to filter
library(dplyr)
library(purrr)
map(mylist[,2], ~ .x %>%
filter(if_any(everything(), ~ .x == "c")))
-output
$mylist
x y
1 c 3
$list2
x y
1 c 6
$list3
x y
1 c 9

How to create pairs from a single column counting the occurrence in R?

So I'm working on creating an edges file for a social network analysis based on IMDb data.
And I've run into a problem and I can't figure out how to fix it as I'm new to R.
Assuming I have the following dataframe:
movieID <- c('A', 'A','A', 'B','B', 'C','C', 'C')
crewID <- c('Z', 'Y', 'X', 'Z','V','V', 'X', 'Y')
rating <- c('7.3','7.3', '7.3', '2.1', '2.1', '9.0','9.0', '9.0')
df <- data.frame(movieID, crewID, rating)
movieID
CrewID
Rating
A
Z
7.3
A
Y
7.3
A
X
7.3
B
Z
2.1
B
V
2.1
C
V
9.0
C
X
9.0
C
Y
9.0
I am trying to build unique pairs of CrewIDs within a movie with a weight that equals the occurrence of that pair, meaning how often these two crew members have worked on a movie together. So basically I want a dataframe like the following as a result:
CrewID1
CrewID2
weight
(not a col but explanation)
Z
Y
1
together once in movie A
Z
X
1
together once in movie A
Y
X
2
together twice in movies A and C
Z
V
1
together once in movie B
V
X
1
together once in movie C
V
Y
1
together once in movie C
The pairs (Z,Y) and (Y,Z) are equal to each other as I don't care about direction.
I found the following StackOverflow thread on a similar issue:
How to create pairs from a single column based on order of occurrence in R?
However in my case this skips the combination (V,Y) and (X,Z) and the count for (X,Y) is still 1 and I can't figure out how to fix it.
m <- crossprod(table(df[-3]))
m[upper.tri(m, diag = TRUE)] <-0
subset(as.data.frame.table(m), Freq > 0)
CrewID CrewID.1 Freq
2 X V 1
3 Y V 1
4 Z V 1
7 Y X 2
8 Z X 1
12 Z Y 1
Maybe not the most efficient solution but this would be one way of doing it:
# Define a function that generates pairs of ids
make_pairs <- function(data){
# Extract all ids in the movie
data$crew %>%
# Organize them alphabetically
sort() %>%
# Generate all unique pairs
combn(2) %>%
# Prep for map
as.data.frame() %>%
# Generate pairs as single string
purrr::map_chr(str_flatten, '_')
}
# Generate the data
tibble::tibble(
movie = c('A', 'A', 'A', 'B','B', "C", 'C', 'C'),
crew = c('Z', 'Y', 'X', 'Z', 'V', 'V', 'X', 'Y')
) %>%
# Nest the data so all ids in one movie gets put together
tidyr::nest(data = -movie) %>%
# Generate pairs of interactions
dplyr::mutate(
pairs = purrr::map(data, make_pairs)
) %>%
# Expand all pairs
tidyr::unnest(cols = pairs) %>%
# Separate them into unique colums
tidyr::separate(pairs, c('id1', 'id2')) %>%
# Count the number of times two ids co-occure
dplyr::count(id1, id2)
# A tibble: 6 x 3
id1 id2 n
<chr> <chr> <int>
1 V X 1
2 V Y 1
3 V Z 1
4 X Y 2
5 X Z 1
6 Y Z 1

Select data frame values row-wise using a variable of column names

Suppose I have a data frame that looks like this:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
# x y
# 1 1 4
# 2 2 5
# 3 3 6
And a vector of column names, one per row of the data frame:
colname = c('x', 'y', 'x')
For each row of the data frame, I would like to select the value from the corresponding column in the vector. Something similar to dframe[, colname] but for each row.
Thus, I want to obtain c(1, 5, 3) (i.e. row 1: col "x"; row 2: col "y"; row 3: col "x")
My favourite old matrix-indexing will take care of this. Just pass a 2-column matrix with the respective row/column index:
rownames(dframe) <- seq_len(nrow(dframe))
dframe[cbind(rownames(dframe),colname)]
#[1] 1 5 3
Or, if you don't want to add rownames:
dframe[cbind(seq_len(nrow(dframe)), match(colname,names(dframe)))]
#[1] 1 5 3
One can use mapply to pass arguments for rownumber (of dframe) and vector for column name (for each row) to return specific column value.
The solution using mapply can be as:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
colname = c('x', 'y', 'x')
mapply(function(x,y)dframe[x,y],1:nrow(dframe), colname)
#[1] 1 5 3
Although, the next option may not be very intuitive but if someone wants a solution in dplyr chain then a way using gather can be as:
library(tidyverse)
data.frame(colname = c('x', 'y', 'x'), stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
left_join(dframe %>% rownames_to_column() %>%
gather(colname, value, -rowname),
by = c("rowname", "colname" )) %>%
select(rowname, value)
# rowname value
# 1 1 1
# 2 2 5
# 3 3 3

Data frame: How to compare current row to some other rows without looping?

I have the following df and use-case, I'd like to find and set something in all rows for which exist another row satisfying a condition e.g.
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'))
> df
X Y
1 a a
2 b c
3 c d
I'd like to find those rows whos Y value is the same as X value in another row. In the example above would be row #2 is true because Y = c and row #3 has X = c. Note that row #1 does not satisfy the condition.
Something like:
df$Flag <- find(df, Y == X_in_another_row(df))
1
For each Y, we check if any value in X (other than in the same row) matches.
sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i])
#[1] FALSE TRUE FALSE
If indices are necessary, wrap the whole thing in which
which(sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i]))
#[1] 2
2 (not tested well)
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = FALSE)
temp = outer(df$X, df$Y, "==") #Check equality among values of X and Y
diag(temp) = FALSE #Set diagonal values as FALSE (for same row)
colSums(temp) > 0
#[1] FALSE TRUE FALSE
which(match(df$Y,df$X)!=1:nrow(df))
I think this should work.
df <- data.frame(X= c(1,2,3,4,5,3,2,1), Y = c(1,2,3,4,5,6,7,8))
which(with(df, (X %in% Y) & (X != Y)))
Works on the original data.frame, if we set stringsasfactors=FALSE
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = F)
which(with(df, (X %in% Y) & (X != Y)))
Quite convoluted but I'll put it here anyway. This should work even if there are repeated values in X.
For example with the following dataframe df2:
df2 = data.frame(X=c('a','b','c','a','d'), Y=c('a','c','d','e','b'))
X Y
1 a a
2 b c
3 c d
4 a e
5 d b
## Specifying the same factor levels allows us to get a square matrix
df2$X = factor(df2$X,levels=union(df2$X,df2$Y))
df2$Y = factor(df2$Y,levels=union(df2$X,df2$Y))
m = as.matrix(table(df2))
valY = rowSums(m)*colSums(m)-diag(m)
which(df2$Y %in% names(valY)[as.logical(valY)])
[1] 1 2 3 5
Essentially you want to know whether Y is in X but you want the condition to be FALSE when X == Y:
df$Z <- with(df, (Y != X) & (Y %in% X))
# Assume you want to use position 4, value 'c', to find all the rows that Y is 'c'
df <- data.frame(X = c('a', 'b', 'd', 'c'),
Y = c('a', 'c', 'c', 'd'))
row <- 4 # assume the desire row is position 4
val <- as.character( df[(row),'X'] ) # get the character and turn it into character type
df[df$Y == val,]
# Result
# X Y
# 2 b c
# 3 d c

Is it possible to merge rows in R data.frame?

If I have the following data.frame:
> df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))
> df
x y
1 a d
2 b* e
3 c f
Is there a clear way to identify rows in which the df$x entries include the string value *, then use this condition to force the string entries of that row to be merged with the row preceding itself, resulting in a data.frame like the following:
> df
x y
1 a b* d e
2 c f
I assume that the first part of the problem (identifying the x row values that include `*) can be done in a fairly straightforward way using regular expressions. I'm having trouble identifying how to force a data.frame row merge with the row preceding it.
One particularly tricky challenge is if multiple entries in a row have the pattern, e.g.
> df <- data.frame(x = c('a', 'b*', 'c*'), y = c('d', 'e', 'f'))
> df
x y
1 a d
2 b* e
3 c* f
In this case, the resulting data.frame should look like this:
> df
x y
1 a b* c* d e f
The main issue that I find is that after running one iteration of a loop that pastes the strings from df[2,] into df[1,], the data.frame index does not adapt to the new data.frame size:
> df
x y
1 a b* d e
3 c* f
So, subsequent indexing is disrupted.
Here a initial solution:
# Creating the data frame
df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'),stringsAsFactors = FALSE)
df
# Creating a vector of rows with *
ast <- grepl("\\*",df$x)
# For loop
for(i in seq(length(ast),1,-1)){
if(ast[i]){
df[i-1,"x"] <- paste(df[i-1,"x"],df[i,"x"],sep=" ")
df[i-1,"y"] <- paste(df[i-1,"y"],df[i,"y"],sep=" ")
df <- df[-i,]
}
}
That's an initial solution because you still have to manage when the first row has * and other situations like this. I hope that helps already.
Here are 3 alternatives (for the base R one, I assumed x and y are characters rather factor. I also made your data more complicated in order to cover different scenarios)
(A bit more complicated data set)
df <- data.frame(x = c('p','a', 'b*', 'c*', 'd', 'h*', 'j*', 'l*', 'n'),
y = c('r','d', 'e', 'f', 'g', 'i', 'k', 'm', 'o'),
stringsAsFactors = FALSE)
Base R
aggregate(. ~ ID,
transform(df, ID = cumsum(!grepl("*", x, fixed = TRUE))),
paste, collapse = " ")
# ID x y
# 1 1 p r
# 2 2 a b* c* d e f
# 3 3 d h* j* l* g i k m
# 4 4 n o
data.table
library(data.table)
setDT(df)[, lapply(.SD, paste, collapse = " "),
by = .(ID = cumsum(!grepl("*", df[["x"]], fixed = TRUE)))]
# ID x y
# 1: 1 p r
# 2: 2 a b* c* d e f
# 3: 3 d h* j* l* g i k m
# 4: 4 n o
dplyr
library(dplyr)
df %>%
group_by(ID = cumsum(!grepl("*", x, fixed = TRUE))) %>%
summarise_all(funs(paste(., collapse = " ")))
# # A tibble: 4 x 3
# ID x y
# <int> <chr> <chr>
# 1 1 p r
# 2 2 a b* c* d e f
# 3 3 d h* j* l* g i k m
# 4 4 n o
Not actually merging the rows, but for those rows that have a * it pastes the value of the previous row in, and then it gets rid of rows that had a * in the following row.
library(dplyr)
df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))
df <- mutate(df,
Operator = grepl("\\*",x), # Check for *
lagged.x = lag(x, n = 1), # Get x value from 1 row ago
lagged.y = lag(y, n = 1), # Get y value from 1 row ago
x = ifelse(Operator, paste(lagged.x, x),x), # if there is * paste lagged x
y = ifelse(Operator, paste(lagged.y, y),y), # if there is * paste lagged y
lead.Operator = lead(Operator, n = 1) # Check if next row has a *
)
# keep only rows that had no * in following row and that had no following row (last row)
df <- filter(df, !lead.Operator | is.na(lead.Operator))
# Select just the x and y columns
df <- select(df, x, y)

Resources