Repeat elements of data.frame [duplicate] - r

This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 7 years ago.
This seems to be a fairly simple problem but I can't find a simple solution:
I want to repeat a data.frame (i) several times as follows:
My initial data.frame:
i <- data.frame(c("A","A","A","B","B","B","C","C","C"))
i
Printing i results in:
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
How I want to repeat the elements (The numbers on the first column is just for easy understanding/viewing)
i
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
I tried doing it using:
i[rep(seq_len(nrow(i)), each=2),]
but it provides me output as such (The numbers on the first column is just for easy understanding/viewing):
1 A
2 A
3 A
1 A
2 A
3 A
4 B
5 B
6 B
4 B
5 B
6 B
7 C
8 C
9 C
7 C
8 C
9 C
Please help!

Not sure if this solves your problem, but to obtain the desired output You could simply repeat the entire sequence:
i <- c("A","A","A","B","B","B","C","C","C")
i2 <- rep(i,2)
#> i2
# [1] "A" "A" "A" "B" "B" "B" "C" "C" "C" "A" "A" "A" "B" "B" "B" "C" "C" "C"
Since you're dealing with a data frame, you could use a slightly modified variant:
i <- data.frame(c("A","A","A","B","B","B","C","C","C"))
i2 <- rep(i[,1],2)

You could use rbind(i, i). Does that work?

If you are working with a data frame, this code will work fine too:
i[rep(1:nrow(i), 5), ,drop=F]

Related

Extract element from vector using for loops and dataframe

I checked answers in
How to extract specific element from vector using for loop in R
But it's not what i want
I have data contains 17 rows and variables
Edit 1: My aim is
1- take names of variable from vectors :
2-Calculate sum of each variable in a vector using data frame
3-keep only the variable that have the highest sum in each vector
so i have data that contains all variables and my aim is to have new_data that contain just the variables that have the highes sum in each vector
that contain just the
I have vector that a generated every time using for loops and it contains names of variables ( diffrent names depending of conditions inside the for loop)
My aim is to eliminate names of variables in every vector except the one that has the highest sum
For example i have this dataframe :
my_data >
NAMES A B C D E F
One 1 2 3 4 5 6
Two 2 3 4 5 6 7
THREE 3 4 5 6 7 8
FOUR 4 5 6 7 8 9
FIVE 5 6 7 8 9 10
SIX 6 7 8 9 10 11
Let's say that the first vector generated by for loop contain names :
vec >
"B" "C" "D"
So using these variable the program will eliminate "B" and "C" because D is the one that has the highes sum :
So i will obtain
New_data
NAMES A D E F
One 1 4 5 6
Two 2 5 6 7
THREE 3 6 7 8
FOUR 4 7 8 9
FIVE 5 8 9 10
SIX 6 9 10 11
Let's say the second vector contain these names "A" , "E"
so the program will eliminate the A because E is the variable that has the highest sum
So
New data >
NAMES D E F
One 4 5 6
Two 5 6 7
THREE 6 7 8
FOUR 7 8 9
FIVE 8 9 10
SIX 9 10 11
Let's say that the third vector conatin "E" and "F"
Here's the part of vector analze programe code i used :
#This is how i generated the vector
vec <- names(Filter(function(x) x > 0, rowSums(tmp) > 0 |
#Vector generated by for loop
my_data %>%
dplyr::select(all_of(vec)) %>% # select vector items
slice(-17) %>% # remove 17 line
map_dbl(sum) %>% # make sum
which.max() %>% # select max
names() -> selected # select max name
#in the variable selected i have the name of variable i should keep
my_data %>% dplyr::select(!vec,selected) -> new_data# select columns
}
The problem with this program is that in the end my new_data contain all the variables except the last comparaison, because it uses always my data so in the last comparaison it compares the variables in my last vector and it keeps all the variables in my_data in new_data except the variables in my last vector that doesn't have the highest sum
So continue on the example i started before :
let's say the third vector conatin "E" and "F" :
The result i need to obtain is :
New data >
NAMES D F
One 4 6
Two 5 7
THREE 6 8
FOUR 7 9
FIVE 8 10
SIX 9 11
#I eliminated E because F has the highes sum
But the program i wrote give me this result :
NAMES A B C D F
One 1 2 3 5 6
Two 2 3 4 6 7
THREE 3 4 5 7 8
FOUR 4 5 6 8 9
FIVE 5 6 7 9 10
SIX 6 7 8 10 11
I think because the program took informations from my first data and it keeps all teh variables that are not in the my vector (that's why in the last comparaison it keeps A B C D )
So now i don't know how to fix this problem
please tell me if you need more informations
You may try this option -
for(i in vec) {
#Get the column names to delete based on column sum
drop_columns <- i[-which.max(colSums(my_data[i]))]
my_data[drop_columns] <- NULL
}
# NAMES D F
#1 One 4 6
#2 Two 5 7
#3 THREE 6 8
#4 FOUR 7 9
#5 FIVE 8 10
#6 SIX 9 11
data
my_data <- structure(list(NAMES = c("One", "Two", "THREE", "FOUR", "FIVE",
"SIX"), A = 1:6, B = 2:7, C = 3:8, D = 4:9, E = 5:10, F = 6:11),
class = "data.frame", row.names = c(NA, -6L))
vec <- list(c('B', 'C', 'D'), c('A', 'E'), c('E', 'F'))
I don't know what you are doing, so here is an alternative.
tmp=replicate(5,{sample(LETTERS[1:10],3)},simplify=F)
[[1]]
[1] "J" "C" "A"
[[2]]
[1] "F" "D" "B"
[[3]]
[1] "C" "G" "H"
[[4]]
[1] "J" "F" "C"
[[5]]
[1] "H" "G" "J"
I made up these vectors of column names, because I don't know how you generate them. Then we iterate this object and remove the columns.
for (i in tmp) {
# your stuff here
df=df[,!colnames(df) %in% i]
}
NAMES E
1 One 5
2 Two 6
3 THREE 7
4 FOUR 8
5 FIVE 9
6 SIX 10

R create group variable based on row order and condition

I have a dataframe containing multiple groups that are not explicitly stated. Instead, new group always start when type == 1, and is the same for following rows, containing type == 2. The number of rows per group can vary.
How can I explicitly create new variable based on order of another column? The groups, of course, should be exclusive.
My data:
df <- data.frame(type = c(1,2,2,1,2,1,2,2,2,1),
stand = 1:10)
Expected output with new group myGroup:
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d
One option could be:
with(df, letters[cumsum(type == 1)])
[1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "d"
Here is another option using rep() + diff(), but not as simple as the approach by #tmfmnk
idx <- which(df$type==1)
v <- diff(which(df$type==1))
df$myGroup <- rep(letters[seq(idx)],c(v <- diff(which(df$type==1)),nrow(df)-sum(v)))
such that
> df
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d

How to remove duplicate consecutive text in R separated by : [duplicate]

This question already has an answer here:
Selecting only unique values from a comma separated string [duplicate]
(1 answer)
Closed 5 years ago.
the data set looks like
id agent final_col
1 1 A:A A
2 1 A:A A
3 2 B B
4 3 C C
5 4 A:C:C A:C
6 4 A:C:C A:C
7 4 A:C:C A:C
How can I remove duplicate entries, to have a clean column like the final_col in R?
Let's just generate a new column based on df$agent
df$final_col <- sapply(df$agent, function(txt){
paste(unique(unlist(strsplit(txt, ":"))), collapse=":")
})
For each element we split by :, select unique elements, and again put them together.
You can do this with gsub and a regular expression
gsub("\\b(\\w+)(\\:\\1)+\\b", "\\1", DAT$agent)
[1] "A" "A" "B" "C" "A:C" "A:C" "A:C"
Your Data
DAT = read.table(text=" id agent final_col
1 1 A:A A
2 1 A:A A
3 2 B B
4 3 C C
5 4 A:C:C A:C
6 4 A:C:C A:C
7 4 A:C:C A:C",
header=TRUE, stringsAsFactors=FALSE)

Reducing lists in R to match another list.

Suppose I have a dataframe 'H', like so
C1 C2
a 1
b 1
c 2
d 3
e 4
f 4
g 5
and a list X (as.factor) that goes
"1" "2" "4"
Using the match command,
X2=H[match(X,H$C2),]
only reduces H to three rows and only one instance of each element of X is present (a,c,e). What command should I employ to reduce H to X such that all instances of elements found in X are present (i.e, the reduced table should contain a,b,c,e,f)?
Cheers.
> H[H$C2 %in% X,]
C1 C2
1 a 1
2 b 1
3 c 2
5 e 4
6 f 4

Filtering a dataframe in r row names from a second data frame in r

I have the data.frame :
df1<-data.frame("Sp1"=1:6,"Sp2"=7:12,"Sp3"=13:18)
rownames(df1)=c("A","B","C","D","E","F")
df1
Sp1 Sp2 Sp3
A 1 7 13
B 2 8 14
C 3 9 15
D 4 10 16
E 5 11 17
F 6 12 18
I filter df1 by a cutoff value for rowSums(df1) and return sites (row names) that I want to include in downstream analysis.
include<-rownames(df1[rowSums(df1)>=22,])
include
[1] "B" "C" "D" "E" "F"
I have a second data.frame :
df2<-data.frame(site.x=c("A","B","C"), site.y=c("D","E","F"),score=1:3)
site.x site.y score
1 A D 1
2 B E 2
3 C F 3
I want to filter df2 such that it only includes rows where df2$site.x and df2$site.y are exactly equal to the sites listed in 'include' i.e. filtering out the row containing "A" and returning.
site.x site.y score
2 B E 2
3 C F 3
I have tried :
filter<-df2$site.x == include & df2$site.y == include
filtered<-df2[filter,]
Thanks for any advice!
ANSWER
use %in%
filter<-df2$site.x %in% include & df2$site.y =%in% include
filtered<-df2[filter,]
filtered
site.x site.y score
2 B E 2
3 C F 3
For me, it works with :
filter<-df2$site.x %in% include & df2$site.y %in% include
df2[filter,]
In fact, you've put df1 instead of df2 in the last two lines of your question.

Resources