I have a data frame that gets updated frequently, and there are some rows that need to be removed from it if certain strings are found in them. I have done that previously using -grep to remove the rows containing the string in question, eg:
dataframe[-grep('some string', dataframe$column),]
However, at times that string doesn't appear in the dataframe, in which case the -grep is returning an empty dataframe. Here's a minimal reproducible example:
> test.df<-data.frame(number=c(1:10), letter=letters[1:10])
> test.df
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
> test.df[-grep('h', test.df$letter),]
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
9 9 i
10 10 j
> test.df[-grep('k', test.df$letter),]
[1] number letter
<0 rows> (or 0-length row.names)
I could wrap the 'test.df[-grep...' in an 'if' test to check if the search string is found prior to removing it, eg:
if(any(grepl('k',test.df$letter))){test.df<-test.df[-grep('k', test.df$letter),]}
...but it seems to me that this should be implicit in the -grep command. Is there a better (more efficient) way to accomplish row removal that doesn't threaten to remove all my data if the search string is absent from the data frame?
Using grepl you could do:
test.df <- data.frame(number = c(1:10), letter = letters[1:10])
test.df[!grepl("h", test.df$letter), ]
#> number letter
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
#> 7 7 g
#> 9 9 i
#> 10 10 j
test.df[!grepl("k", test.df$letter), ]
#> number letter
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
#> 7 7 g
#> 8 8 h
#> 9 9 i
#> 10 10 j
Created on 2023-01-19 with reprex v2.0.2
Instead of using - when subsetting, in grep invert could be used.
test.df[grep('k', test.df$letter, invert=TRUE),]
# number letter
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
#7 7 g
#8 8 h
#9 9 i
#10 10 j
test.df[grep('h', test.df$letter, invert=TRUE),]
# number letter
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
#7 7 g
#9 9 i
#10 10 j
In this case it looks like that the whole sting should be matched, where an alternative would be to use == or !=.
test.df[test.df$letter != "k",]
test.df[test.df$letter != "h",]
I have the following data.frame:
data <- data.frame("ag" = rep(LETTERS[1:4],6),
"date" = c(sapply(1:3, function(x) rep(x, 8))),
"num_var1"= 1:24,
"num_var2"= 24:1,
"alpha_var1" = LETTERS[1:24],
"alpha_var2" = LETTERS[25:2] )
and I would like to summarize (mean) its rows by ag and date using dplyr. The issue is that some rows include characters: in this case, I would like to get the first entry by group (the example dataset is already sorted).
Since my dataset has several entries, I would like the code to be able to recognize whether a variable is numeric (including integers) or a character. However, the best solution that I have so far is the following one:
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), mean))
which creates NAs for non-numeric variables. Do you have a better solution?
Is this what you are looking for?
library(dplyr)
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), ~
if(is.numeric(.x)) mean(.x) else first(.x)))
#> `summarise()` has grouped output by 'ag'. You can override using the `.groups` argument.
#> # A tibble: 12 x 6
#> # Groups: ag [4]
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F
Created on 2022-03-03 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
data %>%
group_by(ag, date) %>%
summarise(across(where(is.numeric), mean),
across(where(is.character), first), .groups = "drop")
#> # A tibble: 12 × 6
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F
I have a data looks like this:
The sample data can be get by following codes:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
I want to build a variable "Event" to capture all events. The final results will look like this:
What should I do? I would like to know as many ways as possible. Thanks.
One option could be using apply() like this. The suggestion from #AllanCameron is also a great choice. Here the code as option for you:
#Vectors
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
#Data
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C,stringsAsFactors = F)
#Option 1
index <- which(grepl('Event',names(Sample.data)))
Sample.data$Event <- apply(Sample.data[,index],1,function(x) paste0(x[x!=''],collapse='/'))
Output:
ID Days Event_P Event_N Event_C Event
1 1 -5 C C
2 1 1
3 1 18 P C P/C
4 1 30
5 2 1 N N
6 2 8
7 2 16 P N C P/N/C
8 3 1
9 3 8
10 3 6 P N C P/N/C
11 4 -6 N N
12 4 1
13 4 7 P N P/N
14 4 15 P N P/N
Duck's answer is very good, but you mentioned you want as many ways as possible so here are two more ways:
You could also use tidyverse's mutate and base r's interaction to combine the columns then use gsub to clear out all the unnecessary things:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
library(tidyverse)
Sample.data %>%
mutate(Event = paste(Event_P, Event_N, Event_C, sep='/'),
Event = gsub('^/|^//|/$|//$', '', Event),
Event = gsub('//', '/', Event))
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Sample.data$Event <-
interaction(Sample.data$Event_P, Sample.data$Event_N, Sample.data$Event_C, sep = '/') %>%
gsub('^/|^//|/$|//$', '', .) %>%
gsub('//', '/', .)
Sample.data
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Created on 2020-09-18 by the reprex package (v0.3.0)
What inside the gsub(^/|^//|/$|//$) does is
^/|^//: Take out all / or // that start the string
/$|//$: Take out all / or // that end the string
This question already has answers here:
Split delimited strings in a column and insert as new rows [duplicate]
(6 answers)
Closed 4 years ago.
Say I have a data frame like the following:
> mydf <- data.frame(a=c('A','B','C','D/E','F','G/H','I/J','K','L'), b=c(1,2,3,'4/5',6,'7/8','9/10',11,12))
> mydf
a b
1 A 1
2 B 2
3 C 3
4 D/E 4/5
5 F 6
6 G/H 7/8
7 I/J 9/10
8 K 11
9 L 12
How do I make it look like the following, with an easy one-liner (preferably base)? Thanks
> mydf2
a b
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
8 H 8
9 I 9
10 J 10
11 K 11
12 L 12
You can use separate_rows from the tidyr package
library(tidyr)
mydf <- data.frame(a=c('A','B','C','D/E','F','G/H','I/J','K','L'), b=c(1,2,3,'4/5',6,'7/8','9/10',11,12))
mydf
#> a b
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D/E 4/5
#> 5 F 6
#> 6 G/H 7/8
#> 7 I/J 9/10
#> 8 K 11
#> 9 L 12
separate_rows(mydf, a, b, convert = TRUE)
#> a b
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D 4
#> 5 E 5
#> 6 F 6
#> 7 G 7
#> 8 H 8
#> 9 I 9
#> 10 J 10
#> 11 K 11
#> 12 L 12
Created on 2018-04-18 by the reprex package (v0.2.0).
The problem I have is as explained in the title. I want to randomize the top, middle and bottom 3 rows in place. Here is a sample dataframe.
> set.seed(7)
> mydf
Id Name Score Feedback
1 1 AB 11 P
2 2 AA 12 P
3 3 AC 12 P
4 4 AD 31 P
5 5 AE 13 P
6 6 AF 15 P
7 7 AG 9 F
8 8 AH 8 F
9 9 AI 11 P
I could take the top, middle and last 3 rows independently and do a randomization and merge them back as follows:
# Take conservative 3 rows from mydf
top3 <- head(mydf,3)
middle3 <- mydf[4:6,]
tail3 <- tail(mydf,3)
# randomize the rows
top3r <- top3[sample(nrow(top3)),]
middle3r <- middle3[sample(nrow(middle3)),]
tail3r <- tail3[sample(nrow(tail3)),]
# merge them back
mydfr <- rbind(top3r, middle3r, tail3r)
> mydfr
Id Name Score Feedback
2 2 AA 12 P
1 1 AB 11 P
3 3 AC 12 P
6 6 AF 15 P
4 4 AD 31 P
5 5 AE 13 P
7 7 AG 9 F
8 8 AH 8 F
9 9 AI 11 P
Is there someway I could achieve the same without going through the manual process of pulling the n rows?
Thank you,
This is basically the same as your code, but without all the intermediate variables.
mydf[c(sample(1:3), sample(4:6), sample(7:9)), ]
Here is a way it could be done if you wanted to use dplyr (I do like the base solution by #Gregor in the comments though).
library(dplyr)
set.seed(1)
mydf %>%
mutate(grp = rep(1:3, each = 3)) %>%
group_by(grp) %>%
sample_n(3)
#> # A tibble: 9 x 5
#> # Groups: grp [3]
#> Id Name Score Feedback grp
#> <int> <chr> <int> <chr> <int>
#> 1 1 AB 11 P 1
#> 2 3 AC 12 P 1
#> 3 2 AA 12 P 1
#> 4 6 AF 15 P 2
#> 5 4 AD 31 P 2
#> 6 5 AE 13 P 2
#> 7 9 AI 11 P 3
#> 8 8 AH 8 F 3
#> 9 7 AG 9 F 3