Selecting columns with only one character in R - r

Here is my data
df<-read.table(text="A1 A2 AA2 A3 APP3 AA4 A4
17 17 14 18 18 14 17
16 15 13 16 19 15 19
17 14 12 19 15 18 14
17 16 16 18 19 19 20
19 18 12 18 13 17 17
12 19 17 18 16 20 18
20 18 14 13 15 15 16
18 20 12 20 12 12 18
12 15 18 14 16 18 18",h=T)
I want to select columns that have only one A, i.e.,
A1 A2 A3 A4
17 17 18 17
16 15 16 19
17 14 19 14
17 16 18 20
19 18 18 17
12 19 18 18
20 18 13 16
18 20 20 18
12 15 14 18
I have used the following code:
df1<- df%>%
select(contains("A"))
but it gives me all As that start with A
Is it possible to get table 2? Thanks for your help.

You can use matches() with a regex pattern. A pattern for "contains exactly 1 'A'" would be this "^[^A]*A[^A]*$"
df %>% select(matches("^[^A]*A[^A]*$"))
# A1 A2 A3 A4
# 1 17 17 18 17
# 2 16 15 16 19
# 3 17 14 19 14
# 4 17 16 18 20
# ...
Based on comments, my best guess for what you want is columns where the name starts with a P and after the P contains only numbers:
# single P followed by numbers
df %>% select(matches("^P[0-9]+$"))
# single A followed by numbers
df %>% select(matches("^A[0-9]+$"))
# single capital letter followed by numbers
df %>% select(matches("^[A-Z][0-9]+$"))

If your not very comfortable with RegEx here's an alternative solution,
The first step is to create a function that counts the number of "A"s in a vector of strings, I will do this by creating a temporary vector of columns names with all the As removed and then subtracting the new number of characters from the original.
count_a<-function(vector,char){
vec2<-gsub("A","",vector, fixed=T)
numb_As<-nchar(vector)-nchar(vec2)
return(numb_As)
}
Once you have this function you simply apply it to the colnames of your dataset and then limit your data to the columns where the count is equal to one.
As<-count_a(colnames(df))
df[,As==1]

If you are not familiar with regular expressions, you can use a function of the popular package for analysing strings: stringr. With one line you get this:
library(stringr)
df[,str_count(names(df),'A')==1]

Related

How to filter DataFrame columns by value keeping original column order?

I'm trying to filter a DataFrame, keeping only columns containing "_time" or "___" in their column names.
I tried using df %>% select(contains(c("_time", "___")). However, this changes the order of the columns in the output, where all columns with _time are displayed first and the columns with "___" are displayed last.
How can filtering be done without changing the column order?
We can use matches
library(dplyr)
df %>%
select(matches("_time|___"))
-output
h_time l_time f___d m_time s___hello
1 11 16 21 26 31
2 12 17 22 27 32
3 13 18 23 28 33
4 14 19 24 29 34
5 15 20 25 30 35
compared to
df %>%
select(contains(c("_time", "___")))
h_time l_time m_time f___d s___hello
1 11 16 26 21 31
2 12 17 27 22 32
3 13 18 28 23 33
4 14 19 29 24 34
5 15 20 30 25 35
data
df <- data.frame(col1 = 1:5, col2 = 6:10, h_time = 11:15,
l_time = 16:20, f___d = 21:25, m_time = 26:30,
col_new = 41:45, s___hello = 31:35)
Base R: Data from #akrun (many thanks)
df[,grepl("_time|___", colnames(df))]
h_time l_time f___d m_time s___hello
1 11 16 21 26 31
2 12 17 22 27 32
3 13 18 23 28 33
4 14 19 24 29 34
5 15 20 25 30 35

Subtract and find the difference of a value or volume

I have a volume measurements of brain parts (optic lobe, olfactory lobe, auditory cortex, etc), all the parts will add up to total brain volume. As shown in the example dataframe here.
a b c d e total
1 2 3 4 5 15
2 3 4 5 6 20
4 6 7 8 9 34
7 8 10 10 15 50
I would like to find the find the difference of brain volume if I subtract one components out of total volume.
So I was wondering how to go about it in R, without having to create a new column for every brain part.
For example: (total - a = 14, total - b =13, and so on for other components).
total-a total-b total-c total-d total-e
14 13 12 11 10
18 17 16 15 14
30 28 27 26 25
43 42 40 40 35
You can do
dat[, "total"] - dat[1:5]
# a b c d e
#1 14 13 12 11 10
#2 18 17 16 15 14
#3 30 28 27 26 25
#4 43 42 40 40 35
If you want also the column names, then one tidyverse possibility could be:
df %>%
gather(var, val, -total) %>%
mutate(var = paste0("total-", var),
val = total - val) %>%
spread(var, val)
total total-a total-b total-c total-d total-e
1 15 14 13 12 11 10
2 20 18 17 16 15 14
3 34 30 28 27 26 25
4 50 43 42 40 40 35
If you do not care about the column names, then with just dplyr you can do:
df %>%
mutate_at(vars(-matches("(total)")), list(~ total - .))
a b c d e total
1 14 13 12 11 10 15
2 18 17 16 15 14 20
3 30 28 27 26 25 34
4 43 42 40 40 35 50
Or without column names with just base R:
df[, grepl("total", names(df))] - df[, !grepl("total", names(df))]
a b c d e
1 14 13 12 11 10
2 18 17 16 15 14
3 30 28 27 26 25
4 43 42 40 40 35

Cut function creates too many levels

I have a list of integers that represent years of education:
education= 12 14 17 15 12 19 16 12 16 14 12 18 12 13 18 18 10 13 12 18
22 16 13 22 12 15 12 16 18 18 18 20 18 16 13 12 16 13 18 20 20 20 14 18
18 12 18 16 20 18 14 16 19 12 12 11 13 13
I am trying to categorize the years into 3 different levels:
9-12
13-17
18+
I have tried to used the cut function:
edulevels=cut(education,c(9,12,13,17,18,22))
but it creates 2 additional levels for 12-13 and 17-18:
Levels: (9,12] (12,13] (13,17] (17,18] (18,22]
How do I get it to only create these three levels?
simplest solution
edulevels= cut(education,c(9,12.5,17.5,22), labels = c("9-12", "13-17", "18+"))
Intervals defined by the cut() function are closed on the right. To see what that means, try this:
cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]
As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.
When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up.
edulevels <- cut(education,
c(8.5, 12.5, 17.5, Inf),
labels=c('9-12','13-17','18+')
)

Add data frames row wise with [d]plyr

I have two data frames
df1
# a b
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25
df2
# a b
# 1 4 8
I want the following output:
df3
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
i.e. add df2 to each row of df1.
Is there a way to get the desired output using plyr (mdplyr??) or dplyr?
I see no reason for "dplyr" for something like this. In base R you could just do:
df1 + unclass(df2)
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
Which is the same as df1 + list(4, 8).
One liner with dplyr.
mutate_each(df1, funs(.+ df2$.), a:b)
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33
A base R solution using sweet function sweep:
sweep(df1, 2, unlist(df2), '+')
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33

trouble combining two numeric columns into one R

so I am having a bit of bother combining two columns into one. I have two columns of ages, which are split into child and adolescent columns. For example:
child adolescent
1 NA 12
2 NA 15
3 NA 12
4 NA 12
5 NA 13
6 NA 13
7 NA 13
8 NA 14
9 14 15
10 NA 12
11 12 13
12 NA 12
13 NA 13
14 NA 14
15 NA 14
16 12 13
17 NA 14
18 NA 13
19 NA 13
20 NA 14
21 NA 12
22 NA 13
23 12 15
24 NA 13
25 NA 15
26 NA 12
27 NA 15
28 NA 15
29 NA 13
30 NA 12
31 13 15`
Now what I would like to do is combine them into one column called "age" and remove all the na values. However when I try the following code, I encounter a problem:
age<- c(na.omit(data$child),na.omit(data$adolescent))
The problem being that my original data has 514 rows, yet when I combine the two columns, removing the nas, I somehow end up with 543 values, not 514 and I don't know why.
So, if possible, could someone explain firstly why I am getting more values than I planned, and secondly what might be a better way to combine the two columns.
EDIT: I am looking for something like this
age
1 12
2 15
3 12
4 12
5 13
6 13
7 13
8 14
9 14
10 12
11 12
12 12
13 13
14 14
15 14
16 12
17 14
18 13
19 13
20 14
21 12
22 13
23 12
24 13
25 15
26 12
27 15
28 15
29 13
30 12
31 13
32 14
33 13
34 11
35 15
36 13
Thanks in advance
This line:
age<- c(na.omit(data$child),na.omit(data$adolescent))
concatenates all the non-missing values from the child field to all the non-missing values from the adolescent field. I suspect you want to use one of these solutions
# youngest age
age<- pmin(data$child,data$adolescent,na.rm=T)
# oldest age
age<- pmax(data$child,data$adolescent,na.rm=T)
# child age, replaced with adolescent if missing
age<- data$child
age[is.na(age)] <- data$adolescent[is.na(age)]
# ^ notice same logical index ^
# |_______________________________|
Your code works on the example data, but you could try this:
age <- c(data$child, data$adolescent)
age <- age[!is.na(age)]
This combines the two columns from the data frame into a vector and removes all NA elements.
df$age <- ifelse( !(is.na(df$child)), df$child , df$adolescent)

Resources