Subset data frame based on column values - r

I have a data frame consisting of the fluorescence read out of multiple cells tracked over time, for example:
Number=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
Fluorescence=c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df = data.frame(Number, Fluorescence)
Which gets:
Number Fluorescence
1 1 9
2 2 10
3 3 20
4 4 30
5 1 8
6 2 11
7 3 21
8 4 31
9 1 6
10 2 12
11 3 22
12 4 32
13 1 7
14 2 13
15 3 23
16 4 33
Number pertains to the cell number. What I want is to collate the fluorescence readout based on the cell number. The data.frame here has it counting 1-4, whereas really I want something like this:
Number Fluorescence
1 1 9
2 1 8
3 1 6
4 1 7
5 2 10
6 2 11
7 2 12
8 2 13
9 3 20
10 3 21
11 3 22
12 3 23
13 4 30
14 4 31
15 4 32
16 4 33
Or even more ideal would be having columns based on Number, then respective cell fluorescence:
1 2 3 4
1 9 10 20 30
2 8 11 21 31
3 6 12 22 32
4 7 13 23 33
I've used the which function to extract them one at a time:
Cell1=df[which(df[,1]==1),2]
But this would require me to write a line for each cell (of which there are hundreds).
Thank you for any help with this! Apologies that I'm still a bit of an R noob.

How about this:
library(tidyr);library(data.table)
number <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
fl <- c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df <- data.table(number,fl)
df[, index:=1:.N, keyby=number]
df
number fl index
1: 1 9 1
2: 1 8 2
3: 1 6 3
4: 1 7 4
5: 2 10 1
6: 2 11 2
7: 2 12 3
8: 2 13 4
9: 3 20 1
10: 3 21 2
11: 3 22 3
12: 3 23 4
13: 4 30 1
14: 4 31 2
15: 4 32 3
16: 4 33 4
The index is added for the unique identifier in spread function from tidyr. Look this post for more information.
spread(df,number,fl)
index 1 2 3 4
1: 1 9 10 20 30
2: 2 8 11 21 31
3: 3 6 12 22 32
4: 4 7 13 23 33

Related

Create edgelist that contains mutual dyads

I have an edgelist where I want to keep dyads that mutually selected each other (e.g., 1 -> 4 and 4 -> 1). However, in the final edgelist I only want to keep one row instead of both rows of the mutual dyads (e.g., only row 1 -> 4 not both rows 1 -> 4 and 4 -> 1). How do I achieve that?
Here is the dataset:
library(igraph)
ff <- as_data_frame(sample_gnm(10, 50, directed=TRUE))
ff
from to
1 1 10
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 1 8
8 2 1
9 2 3
10 2 8
11 2 9
12 3 1
13 3 2
14 3 10
15 3 4
16 3 5
17 3 6
18 3 8
19 3 9
20 4 3
21 4 10
22 5 1
23 5 2
24 5 3
25 5 4
26 6 2
27 6 3
28 6 4
29 6 5
30 7 3
31 7 5
32 7 6
33 7 10
34 7 8
35 8 1
36 8 2
37 8 4
38 8 5
39 8 10
40 9 1
41 9 2
42 9 3
43 9 4
44 9 5
45 9 7
46 10 1
47 10 3
48 10 4
49 10 8
50 10 9
cd <- which_mutual(g) #I know I can use `which_mutual` to identify the mutual dyads
ff[which(cd==1),] #but in the end this keeps both rows of the mutual dyads (e.g., 1 -> 4 and 4 -> 1)
from to
4 1 4
6 1 6
7 1 7
9 2 10
10 2 3
14 3 2
18 3 6
21 4 1
25 5 10
28 6 1
30 6 3
32 6 10
33 6 7
34 7 1
37 7 6
39 7 8
42 8 7
45 9 10
46 10 2
47 10 5
48 10 6
50 10 9
We may use duplicated to create a logical vector after sorting the elements by row
ff1 <- ff[which(cd==1),]
subset(ff1, !duplicated(cbind(pmin(from, to), pmax(from, to))))

R aggregation of columns by met condition in one column

I am trying to aggregate or associate 2 columns in a 4 column matrix. The matrix is filled with numeric values. I would like to show only column1 and column3 when column1 is >.25. I have tried numerous R commands but can't get the 2 columns to show when the criteria is met in column 1.
For example
1.094262, 14
0.5962845, 17
Below is the dataset. Example of desired output above.
0.1287953 3 12 1
1.094262 13 14 3
0.5962845 8 17 4
0.6511204 7 19 5
0.2533915 4 6 2
0.8222555 6 18 6
0.08695875 3 7 1
0.6096232 6 6 2
1.583204 24 7 1
0.08337463 4 7 1
0.06398186 1 11 2
0.2713974 4 11 2
0.6205648 13 4 1
1.276595 15 14 3
Is this what you are looking for?
df[df$V1>0.25,c(1,3)]
V1 V3
2 1.0942620 14
3 0.5962845 17
4 0.6511204 19
5 0.2533915 6
6 0.8222555 18
8 0.6096232 6
9 1.5832040 7
12 0.2713974 11
13 0.6205648 4
14 1.2765950 14
where df is:
df=read.table(text="0.1287953 3 12 1
1.094262 13 14 3
0.5962845 8 17 4
0.6511204 7 19 5
0.2533915 4 6 2
0.8222555 6 18 6
0.08695875 3 7 1
0.6096232 6 6 2
1.583204 24 7 1
0.08337463 4 7 1
0.06398186 1 11 2
0.2713974 4 11 2
0.6205648 13 4 1
1.276595 15 14 3", h=F)

R which argument fits well to obtain nonuniform bins using "plot" to build an informative histogram

I am new to R,I am trying to plot a cumulative frequency histogram(non-uniform bins) for a huge amount of data(few millions of positive numbers with a minimum value "1" and maximum value varies from data to data like for instance 1*10^6 or 1*10^5).I used this simple code to generate a histogram with the data.
for example:-sample data
[89601] 10 2 2 4 3 12 3 25 25 2
[89611] 5 5 5 2 23 22 14 8 13 10
[89621] 13 19 157 2 3 2 4 2 3 33
[89631] 22 2 14 9 2 3 3 3 8 2
[89641] 8 3 2 127 8 2 18 2 4 2
[89651] 2 13 3 34 8 2 6 10 3 7
[89661] 3 9 7 3 36 9 5 2 10 15
[89671] 7 2 23 2 2 2 2 7 6 25
[89681] 3 3 2 6 37 49 28 11 3 35
[89691] 2 2 8 3 3 2 2 4 3 12
[89701] 3 5 2 7 3 2 15 6 3 14
[89711] 13 5 3 2 2 8 34 4 4 65
[89721] 5 9 12 2 11 2 2 79 9 13
[89731] 2 66 2 9 10 22 11 2 6 3
[89741] 12 2 11 5 4 4 2 4 3 4
[89751] 2 8 9 3 2 2 84 7 11 10
[89761] 8 30 16 3 63 2 2 24 13 2
[89771] 11 37 2 9 21 21 10 2 2 49
[89781] 3 3 8 5 2 19 9 6 5 4
[89791] 4 2 9 2 10 33 5 4 2 2
[89801] 4 2 2 4 9 3 11 2 5 142
[89811] 17 2 11 4 2 8 26 2 9 8
[89821] 10 2 4 2 5 2 20 7 145 11
[89831] 22 19 8 14 18 39 3 2 3 3
[89841] 2 11 10 3 2 3 3 5 6 12
[89851] 17 5 3 8 2 2 2 2 2 5
[89861] 4 2 13 3 2 2 2 2 3 2
[89871] 4 3 21 2 6 2 8 9 7 14
[89881] 2 582 3 15 11 3 20 16 9 8
[89891] 6 2 6 7 3 20 17 2 9 5
[89901] 5 11 2 12 7 2 46 2 144 9
[89911] 2 3 36 25 3 2 16 2 2 119
[89921] 5 5 10 6 2 2 6 84 13 2
[89931] 2 6 6 2 17 3 7 4 102 48
data <- read.table("sample.txt", header=FALSE)
data <- hist(data$V1, breaks=length(data$V1), xlim=c(0,4000000))
plot(data)
when I did this I could get a histogram with all the data(positive numbers)on x axis and counts on y-axis.Then again I changed the limit of the x only upto the area of interest
plot(data, xlim=c(0,200000))
Like before a histogram is plotted,but using "plot" I couldn't define the number of bins and hence the histogram is not clear(not like bars which I want to be) and informative.
As I am new to this forum,I have no idea how to upload images,so I couldn't provide with the histogram.
Any suggestions would be very helpful.
For plotting histogram you can use hist() function just this way:
hist(data$V1, xlim=c(0,200000), breaks=100)
The breaks parameter shows, how many bars will be plotted. But this number is related to all plot, not to xlim you specified. So, at first it will make a histogram with given number of breakes and after that it will cut the part of plot you need.
But there is another way to plot the bars:
data <- read.table("sample.txt", header=FALSE)
data.hist <- hist(data$V1, breaks=length(data$V1), xlim=c(0,4000000))
plot(data.hist$counts, type='h')
The hist function returns an object which represents histogram parameters.
I assume, you are interested in "counts" field.
You can plot this info in histogram-like way by defining type='h'.

divide dataframe into subgroups based on several columns successively in R

I have to sort a datapool with following structure into subgroups based on the value of 3 columns in R, but I cannot figure it out.
What I want to do is:
First, sort the datapool based on the column V1, the datapool should be divided into three subgroups according to the value of V1 (the value of V1 should be sorted by descending at first).
Sort each of the 3 subgroups into another 3 subgroups according to the value of V2, now we should have 9 subgroups.
Similarly, subdivide each of the 9 groups into 3 groups again,and resulting in 27 subgroups all together.
the following data is only a simple example, the data have 1545 firms.
Firm value V1 V2 V3
1 7 7 11 8
2 9 9 11 7
3 8 14 8 10
4 9 9 7 14
5 8 11 15 14
6 9 10 9 7
7 8 8 6 14
8 4 8 11 14
9 8 10 13 10
10 2 11 6 13
11 3 5 12 14
12 5 12 15 12
13 1 9 13 7
14 4 5 14 7
15 5 10 5 9
16 5 8 13 14
17 2 10 10 7
18 5 12 12 9
19 7 6 11 7
20 6 9 14 14
21 6 14 9 14
22 8 6 6 7
23 9 11 9 5
24 7 7 6 9
25 10 5 15 11
26 4 6 10 9
27 4 13 14 8
And the result should be:
Firm value V1 V2 V3
5 8 11 15 14
12 5 12 15 12
27 4 13 14 8
21 6 14 9 14
18 5 12 12 9
23 9 11 9 5
10 2 11 6 13
3 8 14 8 10
6 9 10 9 7
20 6 9 14 14
9 8 10 13 10
13 1 9 13 7
8 4 8 11 14
2 9 9 11 7
17 2 10 10 7
4 9 9 7 14
7 8 8 6 14
15 5 10 5 9
16 5 8 13 14
25 10 5 15 11
14 4 5 14 7
11 3 5 12 14
1 7 7 11 8
19 7 6 11 7
26 4 6 10 9
24 7 7 6 9
22 8 6 6 7
I have tried for a long time, also searched Google without success. :(
As #Codoremifa said, data.table can be used here:
require(data.table)
DT <- data.table(dat)
DT[order(V1),G1:=rep(1:3,each=9)]
DT[order(V2),G2:=rep(1:3,each=3),by=G1]
DT[order(V3),G3:=1:3,by='G1,G2']
Now your groups are labeled using the additional columns G1 and G2. To sort, so that it's easier to see the groups, use
setkey(DT,G1,G2,G3)
A couple of the OP's columns are just noise unrelated to the question; to verify that this works by eye, try DT[,list(V1,V2,V3,G1,G2,G3)]
EDIT: The OP did not specify a means of dealing with ties. I guess it makes sense to use the value in the later columns to break ties, so...
DT <- data.table(dat)
DT[order(rank(V1)+rank(V2)/100+rank(V3)/100^2),
G1:=rep(1:3,each=9)]
DT[order(rank(V2)+rank(V3)/100),
G2:=rep(1:3,each=3),by=G1]
DT[order(V3),
G3:=1:3,by='G1,G2']
setkey(DT,G1,G2,G3)
DT[27:1] (the result backwards) is
Firm value V1 V2 V3 G1 G2 G3
1: 5 8 11 15 14 3 3 3
2: 12 5 12 15 12 3 3 2
3: 27 4 13 14 8 3 3 1
4: 21 6 14 9 14 3 2 3
5: 9 8 10 13 10 3 2 2
6: 18 5 12 12 9 3 2 1
7: 10 2 11 6 13 3 1 3
8: 3 8 14 8 10 3 1 2
9: 23 9 11 9 5 3 1 1
10: 20 6 9 14 14 2 3 3
11: 16 5 8 13 14 2 3 2
12: 13 1 9 13 7 2 3 1
13: 8 4 8 11 14 2 2 3
14: 17 2 10 10 7 2 2 2
15: 2 9 9 11 7 2 2 1
16: 4 9 9 7 14 2 1 3
17: 15 5 10 5 9 2 1 2
18: 6 9 10 9 7 2 1 1
19: 11 3 5 12 14 1 3 3
20: 25 10 5 15 11 1 3 2
21: 14 4 5 14 7 1 3 1
22: 26 4 6 10 9 1 2 3
23: 1 7 7 11 8 1 2 2
24: 19 7 6 11 7 1 2 1
25: 7 8 8 6 14 1 1 3
26: 24 7 7 6 9 1 1 2
27: 22 8 6 6 7 1 1 1
Firm value V1 V2 V3 G1 G2 G3
Here is an answer using transform and then ddply from plyr. I don't address the ties, which really means that in case of a tie the value from the lowest row number is used first. This is what the OP shows in the example output.
First, order the dataset in descending order of V1 and create three groups of 9 by creating a new variable, fv1.
dat1 = transform(dat1[order(-dat1$V1),], fv1 = factor(rep(1:3, each = 9)))
Then order the dataset in descending order of V2 and create three groups of 3 within each level of fv1.
require(plyr)
dat1 = ddply(dat1[order(-dat1$V2),], .(fv1), transform, fv2 = factor(rep(1:3, each = 3)))
Finally order the dataset by the two factors and V3. I use arrange from plyr for typing efficiency compared to order
(finaldat = arrange(dat1, fv1, fv2, -V3) )
This isn't a particularly generalizable answer, as the group sizes are known in advance for the factors. If the V3 group size was larger than one, a similar process as for V2 would be needed.

How to create a dataframe with different number of values?

When I create a dataframe I do:
dt = data.frame(a=c(1:5),b=c(1:20))
dt
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 1 6
7 2 7
8 3 8
9 4 9
10 5 10
11 1 11
12 2 12
13 3 13
14 4 14
15 5 15
16 1 16
17 2 17
18 3 18
19 4 19
20 5 20
as you can see the value of the first column (a) are repeated.
How can I create different "columns" with different number of values?
Thanks
H
Use a list. A data.frame is a special kind of list in which all elements are of the same length.
list(a=c(1:5),b=c(1:20))
$a
[1] 1 2 3 4 5
$b
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Resources