cbind multiple, individual columns in a single data frame using column numbers - r

I have a single data frame of 100 columns and 25 rows. I would like to cbind different groupings of columns (sometimes as many as 30 columns) in several new data frames without having to type out each column name every time.
Some columns that i want fall individually e.g. 6 and 72 and some do lie next to each other e.g. columns 23, 24, 25, 26 (23:26).
Usually i would use:
z <- cbind(visco$fish, visco$bird)
for example, but i have too many columns and need to create too many new data frames to be typing the name of every column that i need every time. Generally i do not attach my data.
I would like to use column numbers, something like:
z <- cbind(6 , 72 , 23:26, data=visco)
and also retain the original column names, not the automatically generated V1, V2. I have tried adding deparse.level=2 but my column names then become "visco$fish" rather than the original "fish"
I feel there should be a simple answer to this, but so far i have failed to find anything that works as i would like.

df <- data.frame(AA = 11:15, BB = 2:6, CC = 12:16, DD = 3:7, EE = 23:27)
df
# AA BB CC DD EE
# 1 11 2 12 3 23
# 2 12 3 13 4 24
# 3 13 4 14 5 25
# 4 14 5 15 6 26
# 5 15 6 16 7 27
df1 <- data.frame(cbind(df,df,df,df))
df1
# AA BB CC DD EE AA.1 BB.1 CC.1 DD.1 EE.1 AA.2 BB.2 CC.2 DD.2 EE.2 AA.3 BB.3
# 1 11 2 12 3 23 11 2 12 3 23 11 2 12 3 23 11 2
# 2 12 3 13 4 24 12 3 13 4 24 12 3 13 4 24 12 3
# 3 13 4 14 5 25 13 4 14 5 25 13 4 14 5 25 13 4
# 4 14 5 15 6 26 14 5 15 6 26 14 5 15 6 26 14 5
# 5 15 6 16 7 27 15 6 16 7 27 15 6 16 7 27 15 6
# CC.3 DD.3 EE.3
# 1 12 3 23
# 2 13 4 24
# 3 14 5 25
# 4 15 6 26
# 5 16 7 27
Result <- data.frame(cbind(df1[,c(1:5,14:17,20)]))
Result
# AA BB CC DD EE DD.2 EE.2 AA.3 BB.3 EE.3
# 1 11 2 12 3 23 3 23 11 2 23
# 2 12 3 13 4 24 4 24 12 3 24
# 3 13 4 14 5 25 5 25 13 4 25
# 4 14 5 15 6 26 6 26 14 5 26
# 5 15 6 16 7 27 7 27 15 6 27
Note: The columns with same name are adjusted in their next appearance as .1 or .2 by R itself.

Here's an example of how to do this using the select function from dplyr - which should be your go to package for this type of data wrangling
> library(dplyr)
> df <- head(iris)
> df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
>
>## select by variable name
>newdf <- df %>% select(Sepal.Length, Sepal.Width,Species)
> newdf
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
>## select by variable indices
> newdf <- df %>% select(1:2,5)
> newdf
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
However, I'm not sure why you would need to do this? Can you not run your analyses on the original dataframe?

I understand your question as , subsetting a large dataframe into smaller ones. Which could be achieved in different ways. One way is, data.table package helps you to retain the column names, and yet subset it by indexing the columns.
if you have your data as dataframe, you can just do
DT<- data.table(df)
# You still have to define your subsets of columns you need to create
sub_1<-c(2,3)
sub_2<-c(2:5,9)
sub_3<-c(1:2,5:6,10)
DT[ ,sub_2, with = FALSE]
Output
bird cat dog rat car
1: 0.2682538 0.1386834 0.01633384 0.5336649 0.43432878
2: 0.2418727 0.7530654 0.26999873 0.2679446 0.00859734
3: 0.1211858 0.2563736 0.92637523 0.8572615 0.63165705
4: 0.4556401 0.2343427 0.09324584 0.8731174 0.50098461
5: 0.1646126 0.9258622 0.86957980 0.3636781 0.89608415
Data
require("data.table")
DT <- data.table(matrix(runif(10*10),5,10))
colnames(DT) <- c("fish","bird","cat","dog","rat","tiger","insect","boat","car", "cycle")

Try this
z <- visco[c(6,72,23:26)]

In R we have vectors and matrices. You can create your own vectors with the function c.
c(1,5,3,4)
They are also the output of many functions such as
rnorm(10)
You can turn vectors into matrices using functions such as rbind, cbind or matrix.
Create the matrix from the vector 1:1000 like this:
X = matrix(1:1000,100,10)
What is the entry in row 25, column 3 ?

Related

Add a column that iterates/ counts every time a sequence resets

I have a dataframe, with a column that increases with every row, and periodically (though not regularly) resets back to 1.
I'd like to track/ count these resets in separate column. This for-loop example does exactly what I want, but is incredibly slow when applied to large datasets. Is there a better/ quicker/ more R way to do this same operation:
ColA<-seq(1,20)
ColB<-rep(seq(1,5),4)
DF<-data.frame(ColA, ColB)
DF$ColC<-NA
DF[1,'ColC']<-1
#Removing line 15 and changing line 5 to 1.1 per comments in answer
DF<-DF[-15,]
DF[5,2]<-0.1
for(i in seq(1,nrow(DF)-1)){
print(i)
MyRow<-DF[i+1,]
if(MyRow$ColB < DF[i,'ColB']){
DF[i+1,"ColC"]<-DF[i,"ColC"] +1
}else{
DF[i+1,"ColC"]<-DF[i,"ColC"]
}
}
No need for a loop here. We can just use the vectorized cumsum. This ought to be faster:
DF$ColC<-cumsum(DF$ColB==1)
DF
To keep using varying variable reset values that are always lower then the previous value, use cumsum(ColB < lag(ColB)):
DF %>% mutate(ColC = cumsum(ColB < lag(ColB, default = Inf)))
ColA ColB ColC
1 1 1.0 1
2 2 2.0 1
3 3 3.0 1
4 4 4.0 1
5 5 0.1 2
6 6 1.0 2
7 7 2.0 2
8 8 3.0 2
9 9 4.0 2
10 10 5.0 2
11 11 1.0 3
12 12 2.0 3
13 13 3.0 3
14 14 4.0 3
16 16 1.0 4
17 17 2.0 4
18 18 3.0 4
19 19 4.0 4
20 20 5.0 4

Select varying number of top_n for different groups using dplyr

I have the following dataframe. I want to prefer dplyr to solve this problem.
For each zone I want at minimum two values. Value > 4.0 is preferred.
Therefore, for zone 10 all values (being > 4.0) are kept. For zone 20, top two values are picked. Similarly for zone 30.
zone <- c(rep(10,4), rep(20, 4), rep(30, 4))
set.seed(1)
value <- c(4.5,4.3,4.6, 5,5, rep(3,7)) + round(rnorm(12, sd = 0.1),1)
df <- data.frame(zone, value)
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 2.9
7 20 3.0
8 20 3.1
9 30 3.1
10 30 3.0
11 30 3.2
12 30 3.0
The desired output is as follows
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 3.1
7 30 3.1
8 30 3.2
I thought of using top_n but it picks the same number for each zone.
You could dynamically calculate n in top_n
library(dplyr)
df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value)
# zone value
# <dbl> <dbl>
#1 10 4.4
#2 10 4.3
#3 10 4.5
#4 10 5.2
#5 20 5
#6 20 3.1
#7 30 3.1
#8 30 3.2
can do so
library(tidyverse)
df %>%
group_by(zone) %>%
filter(row_number(-value) <=2 | head(value > 4))

How to replicate observations as a function of the values of another variable

I want to replicate observations based on the values of the variable iptw to create pseudo-populations for further analysis.
For example, if iptw=4.5, then weight=5 should be created, and the observation/row multiplied by 5. Likewise, if iptw=2.3, then weight=2, and that row is multiplied by 2, which is equivalent to adding the corresponding observation twice to the data frame.
Here is my dataset:
dtNEW <- data.table(id = 1:4, x1 = 10:13, x2=21:24, iptw=c(2.3,0.6,4.5,0.1))
There is a similar question here but the solutions there do not answer my question.
Assuming you want to replicate the ith row round(iptw[i]) times:
dtNEW[rep(1:.N, round(iptw)), ]
giving:
id x1 x2 iptw
1: 1 10 21 2.3
2: 1 10 21 2.3
3: 2 11 22 0.6
4: 3 12 23 4.5
5: 3 12 23 4.5
6: 3 12 23 4.5
7: 3 12 23 4.5
Another option is uncount from tidyr
library(tidyr)
uncount(dtNEW, round(iptw))
# id x1 x2 iptw
#1: 1 10 21 2.3
#2: 1 10 21 2.3
#3: 2 11 22 0.6
#4: 3 12 23 4.5
#5: 3 12 23 4.5
#6: 3 12 23 4.5
#7: 3 12 23 4.5

Grouped ranking in R

I have a data with primary key and ratio values like the following
2.243164164
1.429242413
2.119270714
3.013427143
1.208634972
1.208634972
1.23657632
2.212136028
2.168583297
2.151961216
1.159886063
1.234106444
1.694206176
1.401425329
5.210125578
1.215267806
1.089189869
I want to add a rank column which groups these ratios in say 3 bins. Functionality similar to the sas code:
PROC RANK DATA = TAB1 GROUPS = &NUM_BINS
I did the following:
Convert your vector to data frame.
Create variable Rank:
test2$rank<-rank(test2$test)
> test2
test rank
1 2.243164 15.0
2 1.429242 9.0
3 2.119271 11.0
4 3.013427 16.0
5 1.208635 3.5
6 1.208635 3.5
7 1.236576 7.0
8 2.212136 14.0
9 2.168583 13.0
10 2.151961 12.0
11 1.159886 2.0
12 1.234106 6.0
13 1.694206 10.0
14 1.401425 8.0
15 5.210126 17.0
16 1.215268 5.0
17 1.089190 1.0
Define function to convert to percentile ranks and then define pr as that percentile.
percent.rank<-function(x) trunc(rank(x)/length(x)*100)
test3<-within(test2,pr<-percent.rank(rank))
Then I created bins on the fact you wanted 3 of them.
test3$bins <- cut(test3$pr, breaks=c(0,33,66,100), labels=c("0-33","34-66","66-100"))
test x rank pr bins
1 2.243164 15.0 15.0 88 66-100
2 1.429242 9.0 9.0 52 34-66
3 2.119271 11.0 11.0 64 34-66
4 3.013427 16.0 16.0 94 66-100
5 1.208635 3.5 3.5 20 0-33
6 1.208635 3.5 3.5 20 0-33
7 1.236576 7.0 7.0 41 34-66
8 2.212136 14.0 14.0 82 66-100
9 2.168583 13.0 13.0 76 66-100
10 2.151961 12.0 12.0 70 66-100
11 1.159886 2.0 2.0 11 0-33
12 1.234106 6.0 6.0 35 34-66
13 1.694206 10.0 10.0 58 34-66
14 1.401425 8.0 8.0 47 34-66
15 5.210126 17.0 17.0 100 66-100
16 1.215268 5.0 5.0 29 0-33
17 1.089190 1.0 1.0 5 0-33
That work for you?
Almost late but given your data, we can use ntile from dplyr package to get equal sized groups:
df <- data.frame(values = c(2.243164164,
1.429242413,
2.119270714,
3.013427143,
1.208634972,
1.208634972,
1.23657632,
2.212136028,
2.168583297,
2.151961216,
1.159886063,
1.234106444,
1.694206176,
1.401425329,
5.210125578,
1.215267806,
1.089189869))
library(dplyr)
df <- df %>%
arrange(values) %>%
mutate(rank = ntile(values, 3))
values rank
1 1.089190 1
2 1.159886 1
3 1.208635 1
4 1.208635 1
5 1.215268 1
6 1.234106 1
7 1.236576 2
8 1.401425 2
9 1.429242 2
10 1.694206 2
11 2.119271 2
12 2.151961 2
13 2.168583 3
14 2.212136 3
15 2.243164 3
16 3.013427 3
17 5.210126 3
Or see cut_number from ggplot2 package:
library(ggplot2)
df$rank2 <- cut_number(df$values, 3, labels = c(1:3))
values rank rank2
1 1.089190 1 1
2 1.159886 1 1
3 1.208635 1 1
4 1.208635 1 1
5 1.215268 1 1
6 1.234106 1 1
7 1.236576 2 2
8 1.401425 2 2
9 1.429242 2 2
10 1.694206 2 2
11 2.119271 2 2
12 2.151961 2 3
13 2.168583 3 3
14 2.212136 3 3
15 2.243164 3 3
16 3.013427 3 3
17 5.210126 3 3
Because your sample consists of 17 numbers, one bin consists of 5 numbers while the others consist of 6 numbers. There are differences for row 12: ntile assigns 6 numbers to the first and second group, whereas cut_number assigns them to the first and third group.
> table(df$rank)
1 2 3
6 6 5
> table(df$rank2)
1 2 3
6 5 6
See also here: Splitting a continuous variable into equal sized groups

Creating a new column in a data frame whose entries depend on multiple columns in a another data frame

I want to make new column in my data set with the values determined by values in another data set, but it's not as simple as the values in one column being a function of the values in the other. Here's an example:
>df1
chromosome position
1 1 1
2 1 2
3 1 4
4 1 5
5 1 7
6 1 12
7 1 13
8 1 15
9 1 21
10 1 23
11 1 24
12 2 1
13 2 5
14 2 7
15 2 8
16 2 12
17 2 15
18 2 18
19 2 21
20 2 22
and
>df2
chromosome segment_start segment_end segment.number
1 1 1 5 1.1
2 1 6 20 1.2
3 1 21 25 1.3
4 2 1 7 2.1
5 2 8 16 2.2
6 2 18 22 2.3
I want to make a new column in df1 called 'segment', and the value in segment is to be determined by which segment (as determined by 'segment_start', 'segment_end', and 'chromosome' from df2) the value in 'position' belongs to. For example, in df1, row 7, position=13, and chromosome=1. Because 13 is between 6 and 20, the entry in my hypothetical 'segment' column would be 1.2, from row 2 of df2, because 13 falls between segment_start and segment_end from that row (6 and 20, respectively), and the 'chromosome' value from df1 row 7 is 1, just as 'chromosome' in df2 row 2 is 1.
Each row in df1 belongs to one of the segments described in df2; that is, it lies on the same chromosome as one of the segments, and its 'position' is >=segment_start and <=segment_end. And I want to get that information into df1, so it says what segment each position belongs to.
I was thinking of using an if function, and started with:
if(df1$position>=df2$segment_start & df1$position<=df2$segment_end & df1$chromosome==df2$chromosome) df1$segment<-df2$segment.number
But am not sure that way will be feasible. If nothing else maybe the code can help illustrate what it is I'm trying to do. Basically, I want match each row by its position and chromosome to a segment in df2. Thanks.
This appears to be a rolling join. You can use data.table for this
require(data.table)
DT1 <- data.table(df1, key = c('chromosome','position'))
DT2 <- data.table(df2, key = c('chromosome','section_start'))
# this will perform the join you want (but retain all the
# columns with names names of DT2)
# DT2[DT1, roll=TRUE]
# which is why I have renamed and subset here)
DT2[DT1, roll=TRUE][ ,list(chromosome,position = segment_start,segment.number)]
# chromosome position segment.number
# 1: 1 1 1.1
# 2: 1 2 1.1
# 3: 1 4 1.1
# 4: 1 5 1.1
# 5: 1 7 1.2
# 6: 1 12 1.2
# 7: 1 13 1.2
# 8: 1 15 1.2
# 9: 1 21 1.3
# 10: 1 23 1.3
# 11: 1 24 1.3
# 12: 2 1 2.1
# 13: 2 5 2.1
# 14: 2 7 2.1
# 15: 2 8 2.2
# 16: 2 12 2.2
# 17: 2 15 2.2
# 18: 2 18 2.3
# 19: 2 21 2.3
# 20: 2 22 2.3
You really need to check out the GenomicRanges package from Bioconductor. It provides the data structures that are appropriate for your use case.
First, we create the GRanges objects:
gr1 <- with(df1, GRanges(chromosome, IRanges(position, width=1L)))
gr2 <- with(df2, GRanges(chromosome, IRanges(segment_start, segment_end),
segment.number=segment.number))
Then we find the overlaps and do the merge:
hits <- findOverlaps(gr1, gr2)
gr1$segment[queryHits(hits)] <- gr2$segment.number[subjectHits(hits)]
I'm going to assume that the regions in df2 are non-overlapping, continuous and complete (not missing any positions from df1). I seem to do this differently every time I try, so here's my latest idea.
First, make sure chromosome is a factor in both data sets
df1$chromosome<-factor(df1$chromosome)
df2$chromosome<-factor(df2$chromosome)
Now I want to unwrap, chr/pos into one over all generic position, i'll do that with
ends<-with(df2, tapply(segment_end, chromosome, max))
offset<-head(c(0,cumsum(ends)),-1)
names(offset)<-names(ends)
This assigns unique position values to all positions across all chromosomes and it tracks the offset to the beginning of each chromosome in this new system. Now we will build a translation function from the data in df2
seglookup <- approxfun(with(df2, offset[chromosome]+segment_start), 1:nrow(df2),
method="constant", rule=2)
We use approxfun to find the right interval in the genetic position space for each segment. Now we use this function on df1
segid <- with(df1, seglookup(offset[chromosome]+position))
Now we have the correct ID for each position. We can verify this by merging the data and looking at the results
cbind(df1, df2[segid,-1])
chromosome position segment_start segment_end segment.number
1 1 1 1 5 1.1
2 1 2 1 5 1.1
3 1 4 1 5 1.1
4 1 5 1 5 1.1
5 1 7 6 20 1.2
6 1 12 6 20 1.2
7 1 13 6 20 1.2
8 1 15 6 20 1.2
9 1 21 21 25 1.3
10 1 23 21 25 1.3
11 1 24 21 25 1.3
12 2 1 1 7 2.1
13 2 5 1 7 2.1
14 2 7 1 7 2.1
15 2 8 8 16 2.2
16 2 12 8 16 2.2
17 2 15 8 16 2.2
18 2 18 18 22 2.3
19 2 21 18 22 2.3
20 2 22 18 22 2.3
So it looks like we did alright.

Resources