removing columns based on segment of column names - r

I have a dataframe that has multiple columns (close to 100) I don't need that have "CNT" in the middle. Below is a short example:
id drink drink_CNT_v2 sage_CNT_v5
1 12 23 12
2 14 32 13
3 15 12 12
4 16 12 43
5 20 50 23
I want to remove all variables that have CNT in the middle. Does anyone know how I could do that. I tried using mutate in tidyverse, but that didn't work.

We could use contains in select
library(dplyr)
df2 <- df1 %>%
select(-contains("_CNT_"))
-output
df2
id drink
1 1 12
2 2 14
3 3 15
4 4 16
5 5 20
data
df1 <- structure(list(id = 1:5, drink = c(12L, 14L, 15L, 16L, 20L),
drink_CNT_v2 = c(23L, 32L, 12L, 12L, 50L), sage_CNT_v5 = c(12L,
13L, 12L, 43L, 23L)), class = "data.frame", row.names = c(NA,
-5L))

In base R, with grepl:
df[!grepl("CNT", colnames(df))]
Also works with select (use grep):
df %>%
select(-grep("CNT", names(.)))

Related

Assign value in vector based on presence in another vector in R?

I have tried to look for a similar question and I´m sure other people encountered this problem but I still couldn´t find something that helped me. I have a dataset1 with 37.000 observations like this:
id hours
130 12
165 56
250 13
11 15
17 42
and another dataset2 with 38. 000 observations like this:
id hours
130 6
165 23
250 9
11 14
17 11
I want to do the following: if an id of dataset1 is in dataset2, the hours of dataset1 should override the hours of dataset2. For the id´s who are in dataset1 but not in dataset2, the value for dataset2$hours should be NA.
I tried the %in% operator, ifelse(), a loop, and some base R commands but I can´t figure it out. I always get the error that the vectors don´have the same length.
Thanks for any help!
You can replace hours with NAs for id that don't match between df1 and df2. Since both your data sets had the same values for ids, I added one row in df1 with id = 123 and hours = 12.
df1$hours <- replace(df1$hours, is.na(match(df1$id,df2$id)), NA)
df1
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 NA
data
df1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 123L), hours = c(12L,
56L, 13L, 15L, 42L, NA)), row.names = c(NA, -6L), class = "data.frame")
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 12
df2 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L), hours = c(6L,
23L, 9L, 14L, 11L)), class = "data.frame", row.names = c(NA,
-5L))
First match ID's of replacement data with ID's of original data while using na.omit() for the case when replacement ID's are not contained in original data. Replace with replacement data whose ID's are in original ID's.
I expanded both data sets to fabricate cases with no matches.
dat1
# id hours
# 1 130 12
# 2 165 56
# 3 250 13
# 4 11 15
# 5 17 42
# 6 12 232
# 7 35 456
dat2
# id hours
# 1 11 14
# 2 17 11
# 3 165 23
# 4 999 99
# 5 130 6
# 6 250 9
Replacement
dat1[na.omit(match(dat2$id, dat1$id)), ]$hours <-
dat2[dat2$id %in% dat1$id, ]$hours
dat1
# id hours
# 1 130 6
# 2 165 23
# 3 250 9
# 4 11 14
# 5 17 11
# 6 12 232
# 7 35 456
Data:
dat1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 12L, 35L),
hours = c(12L, 56L, 13L, 15L, 42L, 232L, 456L)), class = "data.frame", row.names = c(NA,
-7L))
dat2 <- structure(list(id = c(11L, 17L, 165L, 999L, 130L, 250L), hours = c(14L,
11L, 23L, 99L, 6L, 9L)), class = "data.frame", row.names = c(NA,
-6L))

Change row name returns "duplicate 'row.names' are not allowed" in R

I've tried change row names from formate from "data07_2470178_2" to "2470178" by following code:
rownames(df) <-regmatches(rownames(df), gregexpr("(?<=_)[[:alnum:]]{7}", rownames(df), perl = TRUE))
But it returns following error:
Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
The dataset briefly looks like this:
1 2 3 4
data143_2220020_1 24 87 3 32
data143_2220020_2 24 87 3 32
data105_2220058_1 26 91 3 36
data105_2220058_2 26 91 3 36
data134_2221056_2 13 40 3 17
data134_2221056_1 13 40 3 17
And I'd like my dataset looks like this. For every original row only remain the one ended with "_2":
1 2 3 4
2220020 24 87 3 32
2220058 26 91 3 36
2221056 13 40 3 17
I really don't understand why is this case? Also, how can I change row name correctly? Could anyone help? Thanks in advance!
If you want to remove rows based on rownames, you can use :
rn <- sub('.*_(\\d+)_.*', '\\1', rownames(df))
df1 <- df[!duplicated(rn), ]
rownames(df1) <- unique(rn)
df1
# 1 2 3 4
#2220020 24 87 3 32
#2220058 26 91 3 36
#2221056 13 40 3 17
However, unique(df) would automatically give you only unique rows and you can change the rownames based on above method.
data
df <- structure(list(`1` = c(24L, 24L, 26L, 26L, 13L, 13L), `2` = c(87L,
87L, 91L, 91L, 40L, 40L), `3` = c(3L, 3L, 3L, 3L, 3L, 3L), `4` = c(32L,
32L, 36L, 36L, 17L, 17L)), class = "data.frame",
row.names = c("data143_2220020_1",
"data143_2220020_2", "data105_2220058_1", "data105_2220058_2",
"data134_2221056_2", "data134_2221056_1"))

Create a new data frame column that is a combination of other columns

I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)

Finding columns that contain values based on another column

I have the following data frame:
Step 1 2 3
1 5 10 6
2 5 11 5
3 5 13 9
4 5 15 10
5 13 18 10
6 15 20 10
7 17 23 10
8 19 25 10
9 21 27 13
10 23 30 7
I would like to retrieve the columns that satisfy one of the following conditions: if step 1 = step 4 or step 4 = step 8. In this case, column 1 and 3 should be retrieved. Column 1 because the value at Step 1 = value at step 4 (i.e., 5), and for column 3, the value at step 4 = value at step 8 (i.e., 10).
I don't know how to do that in R. Can someone help me please?
You can get the column indices by the following code:
df[1, -1] == df[4, -1] | df[4, -1] == df[8, -1]
# X1 X2 X3
# 1 TRUE FALSE TRUE
# data
df <- structure(list(Step = 1:10, X1 = c(5L, 5L, 5L, 5L, 13L, 15L,
17L, 19L, 21L, 23L), X2 = c(10L, 11L, 13L, 15L, 18L, 20L, 23L,
25L, 27L, 30L), X3 = c(6L, 5L, 9L, 10L, 10L, 10L, 10L, 10L, 13L,
7L)), class = "data.frame", row.names = c(NA, -10L))

How to convert intervals with value to individual position level in R

I struggle a bit with following problem:
I have table A (below) and I would like to merge/reduce/covert intervals defined in there to individual positions like in table B by calculating sum (values in table A) of overlapping positions in intervals (start and end of each interval in table A) if any or just give value if no overlapping positions or 0 if no interval for that position. I would prefer solution for that problem in R. I would really appreciate your help.
Table A
ID Start End Value
1 1 5 9
2 3 7 5
3 5 9 13
4 11 15 1
5 12 16 18
6 14 18 21
Convert to this Table B
Position Value
1 9
2 9
3 14
4 14
5 27
6 18
7 18
8 13
9 13
10 0
11 15
12 33
13 33
14 54
15 54
16 39
17 21
18 21
Not a very straight forward way but it gets the job done:
df<-structure(list(ID = 1:6, Start = c(1L, 3L, 5L, 11L, 12L, 14L),
End = c(5L, 7L, 9L, 15L, 16L, 18L),
Value = c(9L, 5L, 13L, 1L, 18L, 21L)), .Names = c("ID", "Start", "End", "Value"),
class = "data.frame", row.names = c(NA,
-6L))
# create list matrix for each grouping
s1<-lapply(1:6, function(i) {matrix(c(df[i,2]:df[i,3], rep(df[i,4], (df[i,3]-df[i,2]+1))), nrow = (df[i,3]-df[i,2])+1)})
s2<-as.data.frame(do.call(rbind, s1))
#sum all of the like positions
library(dplyr)
wgaps<-summarise(group_by(s2, V1), sum(V2))
#create sequence with no gaps in it and match
nogaps<-data.frame(Position=seq(min(wgaps$V1), max(wgaps$V1)))
nogaps<-left_join(nogaps, wgaps, by=c("Position"= "V1"))
names(nogaps)<-c("Position", "value") #rename
nogaps$value[is.na(nogaps$value)]<-0 #remove 0

Resources