Select columns from dataframe start with number - r

I have a data frame with column names with name start with numbers and names with string and I want to subset with names starting with numbers followed by dots.
this code is working for this sample but in my actual data frame the column AA ID get selected. I don't know the reason
df <- data.frame(`AA ID`=c(1,2,3,4,5,6,7,8,9,10),
"BB"=c("AMK","KAMl","HAJ","NHS","KUL","GAF","BGA","NHU","VGY","NHU"),
"CC"=c("TAMAN","GHUSI","KELVIN","DEREK","LOKU","MNDHUL","JASMIN","BINNY","BURTAM","DAVID"),
"DD"=c(62,41,37,41,32,74,52,75,59,36),
"EE"=c("CA","NY","GA","DE","MN","LA","GA","VA","TM","BA"),
"FF"=c("ENGLISH","FRENCH","ENGLISH","FRENCH","ENGLISH","ENGLISH","SPANISH","ENGLISH","SPANISH","RUSSIAN"),
"GG"=c(33,44,51,51,37,58,24,67,41,75),
`1A`=c("","D","","NA","","D","","","D",""),
`2B`=c("","A","","","A","A","A","A","",""),
`3C`=c("","","","","","","","","",""),
`4D`=c("","G","G","G","G","G","G","G","",""),
"Concatenate" = c("","DAG","G","NAG","AG","DAG","AG","AG","D",""))
df <- df %>% rename(`1. A`="X1A",`1. B`="X2B",`1. C`="X3C",`1. D`="X4D")
Error_summary <- select(df,matches("^[0-9]*\\."))
also I am trying to add count in data frames like below
df_row =
df %>%
summarize(across(c(matches("^[0-9]*\\."), Concatenate), ~ sum(!is.na(.) & . != "" & . != "NA")))
but this is also selecting column AA ID which i dont want to select.

Taking into account that your variables supposed to starting with numbers will be converted to variable names starting with X, you could do:
library(tidyverse)
df %>%
select(matches("^X[0-9]"))
which gives:
X1..A X2..B X3..C X4..D
1
2 D A G
3 G
4 NA G
5 A G
6 D A G
7 A G
8 A G
9 D
10
With the same logic you can do your counts:
df %>%
summarize(across(c(matches("^X[0-9]"), Concatenate), ~ sum(!is.na(.) & . != "" & . != "NA")))
which gives
X1..A X2..B X3..C X4..D Concatenate
1 3 5 0 7 8
Although I'm not sure if you want to exclude the "NAG" value in the Concatenate column.

Related

Find duplicates in comma-separated string data using R

I am currently trying to run multiple data using loop, and ended up with the results below. Each line corresponds to the output from one data that has been filtered.
I am using this code to get the results below.
output <- print(paste(data.final$Peptide, collapse = ','))
It was previously given the form of table with one of the column name "Peptide", so I am pasting the peptide into a string separated comma, as shown here :
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
I would like to find the number of duplicates from each comma-separated strings (eg. LPPAYTNSF) between the lines.
Is there anyways to do this?
If I understand you well, there is no reason to first collapse them to a string. Why not just get the counts from your data.frame? Any n > 1 has duplicates.
data.final %>% count(Peptide)
solution based on your string
table(unlist(strsplit(v, ",")))
results
ASFSTFKCY CVADYSVLY KIYSKHTPI LPFNDGVYF LPPAYTNSF LPSAYTNSF QSYGFQPTY RLFRKSNLK SANNCTFEY SAYTNSFTR TSNQVAVLY WMESEFRVY
8 8 8 8 6 2 6 8 8 2 8 8
WTAGAAAYY YLQPRTFLL YNSASFSTF YSSANNCTF
8 8 8 8
data
v <- "LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
Not quite sure what your data are like so I've scribbled together some toy data to show how a regex and a non-regex tidyverse solution would work on your task of "find[ing] the number of duplicates from each comma-separated strings":
Data:
x <- c("a,c,x,a,f,s,w,s,b,n,x,q",
"A,B,B,X,B,Q")
A regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# create new column with duplicated values:
mutate(dups = str_extract_all(x, "([A-Za-z])+(?=.*\\1)"))
x dups
1 a,c,x,a,f,s,w,s,b,n,x,q a, x, s
2 A,B,B,X,B,Q B, B
NB: the + in the regex pattern makes sure you can use this solution also for longer-than-one-character alphabetic comma-separated strings
Alternatively, a non-regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# make column with unique ID for each string:
mutate(stringID = row_number()) %>%
# separate values into rows:
separate_rows(x) %>%
# for each combination of `stringID` and `x`...
group_by(stringID, x) %>%
# ...count the number of tokens:
summarise(N = n()) %>%
# show only the duplicated values:
filter(N > 1)
# A tibble: 4 × 3
# Groups: stringID [2]
stringID x N
<int> <chr> <int>
1 1 a 2
2 1 s 2
3 1 x 2
4 2 B 3

Looping over multiple columns to generate a new variable based on a condition

I am trying to generate a new column (variable) based on the value inside multiple columns.
I have over 60 columns in the dataset and I wanted to subset the columns that I want to loop through.
The column variables I am using in my condition at all characters, and when a certain pattern is matched, to return a value of 1 in the new variable.
I am using when because I need to run multiple conditions on each column to return a value.
CODE:
df read.csv("sample.csv")
*#Generate new variable name*
df$new_var <- 0
*#For loop through columns 16 to 45*
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
grepl("I8501", df[[i]]) ~ 1
))
}
This does not work as when I table the results, I only get 1 value matched.
My other attempt was using:
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
df[[i]] == "I8501" ~ 1
))
}
Any other possible ways to run through multiple columns with multiple conditions and change the value of the variable accordingly? to be achieved using R ?
If I'm understanding what you want, I think you just need to specify another case in your case_when() for keeping the existing values when things don't match "I8501". This is how I would do that:
df$new_var <- 0
for (index in (16:45)) {
df <- df %>%
mutate(
new_var = case_when(
grepl("I8501", df[[index]]) ~ 1,
TRUE ~ df$new_var
)
)
}
I think a better way to do this though would be to use the ever useful apply():
has_match = apply(df[, 16:45], 1, function(x) sum(grepl("I8501", x)) > 0)
df$new_var = ifelse(has_match, 1, 0)
Kindly check if this works for your file.
Sample df:
df <- data.frame(C1=c('A','B','C','D'),C2=c(1,7,3,4),C3=c(5,6,7,8))
> df
C1 C2 C3
1 A 1 5
2 B 7 6
3 C 3 7
4 D 4 8
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.numeric(any(str_detect(c_across(2:last_col()), "7")))) # change the 2:last_col() to select your column range ex: 2:5
Output for finding "7" in any of the columns:
C1 C2 C3 new_var
<chr> <dbl> <dbl> <dbl>
1 A 1 5 0
2 B 7 6 1
3 C 3 7 1
4 D 4 8 0

Grouping data in R by presence of string within lists of strings

I have a number of variables to group on, some are "normal" grouping variables (i.e. numeric/character strings) and some of which are lists of strings. What I want to do is determine a group based on matches between the "normal" variables & the presence of a match between any string within these lists.
Example data:
dat <- data.frame(cod=c(1:5,1:5,3))
dat$lst <- c(list(c("a","a"),c("b"),c("x","x"),c("r","r"),c("t","t"),c("a"),c("e"),c("f","x"),c("e","q"),c("t"),c("f","f")))
dat <- dat %>% arrange(cod)
#Data
cod lst
1 a, a
1 a
2 b
2 e
3 x, x
3 f, x
3 f, f
4 r, r
4 e, q
5 t, t
5 t
So where lst contains a string that is present in several of the same cods I want to create a group, like this:
# Desired output
cod lst grp
1 a, a 1
1 a 1
2 b 2
2 e 3
3 x, x 4
3 f, x 4
3 f, f 4
4 r, r 5
4 e, q 6
5 t, t 7
5 t 7
In grp 4, all three observations should be linked as there are common list items shared between all three (i.e. one observation has a lst value of c(f,x) which links c(f,f) and c(x,x) within the cod group 3)
I tried to just create a logical TRUE/FALSE column that would show if some cods should be grouped via:
dat %>%
group_by(cod) %>%
mutate(grp = ifelse((lst %in% lst),TRUE,FALSE))
As well as via:
for (i in 1:dim(dat)[1]) {
if (all(any(dat$cod[i] %in% dat$cod) & any(dat$lst[[i]] %in% unlist(dat$list[-i])))) {
dat$grp[i] <- TRUE
} else {
dat$grp[i] <- FALSE
}
}
But nothing has been able to isolate unique groups so far. Any help greatly appreciated!!!! Thanks!

Summarize Data String in R

I have a large dataset where some of the information needed is stored in the first column as a string separated by semicolons. For example:
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
Giving:
Information Data
1 Forrest;Trees;Unknown 5
2 Forrest;Trees;Leaves 1
3 Forrest;Trees;Trunks 3
4 Forrest;Shrubs;Unknown 4
5 Forrest;Shrubs;Branches 2
6 Forrest;Shrubs;Leaves 1
7 Forrest;Shrubs;NA 3
I need to simplify the names so that I only have the last unique name that isn't "Unknown" or "NA" such that my dataframe becomes:
Information Data
1 Trees;Unknown 5
2 Trees;Leaves 1
3 Trunks 3
4 Shrubs;Unknown 4
5 Branches 2
6 Shrubs;Leaves 1
7 Shrubs;NA 3
Maybe it's not the most efficient or elegant solution, but it works on the sample data. Hope it's also adequate for your needs:
library(stringr)
library(dplyr)
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
# split text into 3 columns
TestData[3:5] <- str_split_fixed(TestData$Information, ";", 3)
# filter Unknown and NA values, count frequencies to determine unique values
a <- TestData %>%
filter(!V5 %in% c("Unknown", "NA")) %>%
group_by(V5) %>%
summarise(count = n())
# join back to original data
TestData <- TestData %>%
left_join(a)
TestData$Clean <- ifelse(TestData$count > 1 | is.na(TestData$count), paste0(TestData$V4, ";", TestData$V5), TestData$V5)
Generally it is not recommended to put multiple variables in the same column but using dplyr should give you what you want:
TestData_filtered<-TestData%>%separate(Information,into=c("common","TS","BL"),remove=FALSE)%>%filter(!grepl("Unknown|NA",BL))%>%mutate(wanted=paste(TS,BL,sep=";"))

Select rows in data frame with same values

I have a dataframe with unique values $Number identifying specific points where a polygon is intersecting. Some points (i.e. 56) have 3 polygons that intersect. I want to extract the three rows which start with 56.
df <- cbind(Number = rownames(check), check)
df
df table
The issue going forward is I will be applying this for 10,000 points and won't know the repeating number such as "56". So is there a way to have a general expression which chooses rows with a general match without knowing that value?
You can achieve the desired output with:
subset2 <- function(n) df[floor(df$Number) == n,]
where df is the name of your dataset and Number is the name of the target column. We can fill in n as needed:
#Example
df <- data.frame(Number=c(1,3,24,56.65,56.99,56.14,66),y=sample(LETTERS,7))
df
# Number y
# 1 1.00 J
# 2 3.00 B
# 3 24.00 D
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
# 7 66.00 V
subset2(56)
# Number y
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
I simply changed the $Number column into a numeric field, then rounded down to integer data.
numeric <- as.numeric(as.character(df$Number))
Id <- floor(numeric)
If we only want $Number with more than 3 counts then we can use dplyr to group by $Number and then retain $Number if it has more than 3 counts
library(dplyr)
# Data
df <- data.frame(Number = c(1,1,1,2,2,3,3))
# Filtering
df %>% group_by(Number) %>% filter(n() >= 3)

Resources