This question already has answers here:
Get last row of each group in R [duplicate]
(4 answers)
Closed 2 years ago.
Hello I have a df such as
COL1 COL2 COL3 COL4
NA NA Sp_canis_lupus 10
3 8 Sp_canis_lupus 10
3 8 Sp_canis_lupus 10
How can I remove duplicate rows in COL3 and keep the last row ?
Here I should get :
COL1 COL2 COL3 COL4
3 8 Sp_canis_lupus 10
Thank you very much for your help
You could also solve this with aggregate, like below:
aggregate(. ~ COL3, data = df, FUN = tail, 1)
Or another way in dplyr:
library(dplyr)
df %>%
group_by(COL3) %>%
slice(n())
This of course assumes that you're only after duplicates in COL3 - otherwise you'll need to rephrase the problem (as the example doesn't seem to be particularly complex).
Using dplyr:
df %>%
group_by(COL3) %>%
filter(row_numer() == n() )
Upvote if it helps thanks!
Use duplicated to find duplicates - and then select those that are not duplicated, i.e. x[!duplicated(x), ]. You may need to make the statement a bit more elaborate given that you have NAs in there.
Related
This question already has answers here:
Is there an R function for finding the index of an element in a vector?
(4 answers)
Closed 1 year ago.
I have a column of random words, and some words contain the months (it could be anything such as Jan, or Dec), I want to be able to find those row numbers with months name. How can I do that?
df = tibble(word=c("asd","May","jbsd"))
grepl(c("Jan","Feb","Mar","Apr","May", etc), df[["word"]])
You can use which with %in% to get the row number of the match.
which(df$word %in% month.abb)
#[1] 2
Note that mont.abb is locale-specific so if df$word is in English it is expected that your system locale is of the same language.
Edit: just saw the comments, I let mine in case you want to have the month name
Using dplyr:
df %>%
mutate(rownumber = row_number()) %>%
filter(word %in% month.abb)
Output:
# A tibble: 1 x 2
word rownumber
<chr> <int>
1 May 2
This question already has answers here:
Extract the first 2 Characters in a string
(4 answers)
Closed 2 years ago.
So I'm needing to rename a column in R and from that column I need to condense the column. For example in the initial data frame it would say "2017-18" and "2018-19" and I need it to condense to the first four digits, essentially cutting off the "-##" portion. I've attempted to use substr() and when I do it says that I'm having issues with it converting to characters or attempting to convert to a character.
data <- read_excel("nba.xlsx")
data1<- data %>%
rename(year=season) %>%
select(year)
data1 <- data1 + as.numeric(substr(year,1,4))
Above is my code that I currently and have tried rearranging and moving things around. Any help would be greatly appreciated. Thank you!
Use str_replace:
df <- tibble(season = c("2017-18", "2018-19"))
df %>% mutate(year = str_replace(season, "-.*", ""))
# A tibble: 2 x 2
season year
<chr> <chr>
1 2017-18 2017
2 2018-19 2018
Alternately, use str_sub:
str_sub(season, 1, 4)
This question already has answers here:
Duplicated rows: select rows based on criteria and store duplicated values
(2 answers)
Closed 3 years ago.
I have a large data set with text comments and their ratings on different variables, like so:
df <- data.frame(
comment = c("commentA","commentB","commentB","commentA","commentA","commentC"
sentiment=c(1,2,1,4,1,2),
tone=c(1,5,3,2,6,1)
)
Every comment is present between one and 3 times, since multiple people are asked to rate the same comment sometimes.
I'm looking to create a data frame where the "comment" column only has unique values, and the other columns are appended, so any one text comment has as many "sentiment" and "tone" columns as there are ratings (which will result in NA's for comments that have not been rated as often, but that's okay):
df <- data.frame(
comment = c("commentA","commentB","commentC",
sentiment.1=c(1,2,2),
sentiment.2=c(4,1,NA),
sentiment.3=c(1,NA,NA),
tone.1=c(1,5,1),
tone.2=c(2,3,NA),
tone.3=c(6,NA,NA)
)
I've been trying to figure this out using reshape to go from long to wide using
reshape(df,
idvar = "comment",
timevar = c("sentiment","tone"),
direction = "wide"
)
But that results in all possible combinations between sentiment and tone, rather than simply duplicating sentiment and tone independently.
I also tried using gather like so df %>% gather(key, value, -comment), but that only gets me halfway there...
Could anyone please point me in the right direction?
You need to create a variable to use as the numbers in the columns. rowid(comment) does the trick.
In dcast you put the row identifiers to the left of ~ and the column identifiers to the right. Then value.var is a character vector of all columns you want to include int this long-to-wide transformation.
library(data.table)
setDT(df)
dcast(df, comment ~ rowid(comment), value.var = c('sentiment', 'tone'))
# comment sentiment_1 sentiment_2 sentiment_3 tone_1 tone_2 tone_3
# 1: commentA 1 4 1 1 2 6
# 2: commentB 2 1 NA 5 3 NA
# 3: commentC 2 NA NA 1 NA NA
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I want to filer the data frame to remove rows that occure with similar names in col0. I two or more similar names occur, I want to keep the row with highest values in col1.
col0 col1 col2 col3 col4 col4 col5
hsa-let-7a-5p 2.487304 15.04636 8.400422 1.702870e-10 1.084728e-07 13.867065
hsa-let-7a-5p 2.491626 13.70345 7.414093 4.002913e-09 1.274928e-06 10.808433
hsa-let-7d-5p 3.074776 11.36059 6.799401 2.977052e-08 6.321274e-06 8.887774
hsa-miR-7d-5p 3.123776 11.84145 6.210222 2.069015e-07 3.050719e-05 7.032421
hsa-miR-122-5p -2.521427 13.91681 -6.132486 2.673240e-07 3.050719e-05 6.703794
hsa-miR-122-5p 2.602304 11.53867 6.083099 3.145797e-07 3.050719e-05 6.636385
In my example I want to keep row2,row4 and row6.
Any tips on function?
Assuming that it is a data.frame, then it cannot have duplicated row names. So, either it must be a matrix or it could be the first column of data.frame. By assuming that, grouped by the first column i.e. 'col0', slice the row with the maximum value in 'col1'
library(dplyr)
df1 %>%
group_by(col0) %>%
slice(which.max(col1))
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'
Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
Maybe closer to what you had in mind is to create a vector of frequencies using ave:
subset(df1, ave(ID, ID, FUN = length) > cutoff)