Selecting data.frame rows using a vector of factors - r

I have a dataframe like this:
> df
# 1 2 3 4 5 6 7 8 9 10
# ENSG00000000003 2407 2345 1052 2191 2542 812 3595 4215 1100 5457
# ENSG00000000005 0 5 0 0 1 0 1 0 12 0
# ENSG00000000419 1843 1528 1520 1789 1144 1946 2017 2794 1455 2258
# ENSG00000000457 611 536 496 637 621 687 966 774 822 3026
# ENSG00000000460 453 493 884 1180 338 541 606 650 520 3479
# ENSG00000000938 249 296 995 113 1073 233 333 4441 2708 404
# ENSG00000000971 3570 1126 2431 1395 6452 7677 8222 1188 20762 4111
# ENSG00000001036 3774 1573 3323 1958 2029 2022 4236 1641 4195 1313
and want to select the following genes:
genes <- c("ENSG00000000003", "ENSG00000000460", "ENSG00000001084")
Why do I get incorrect result when selecting the rows by this way:
> df[factor(genes), ]
# 1 2 3 4 5 6 7 8 9 10
# ENSG00000000003 2407 2345 1052 2191 2542 812 3595 4215 1100 5457
# ENSG00000000005 0 5 0 0 1 0 1 0 12 0
# ENSG00000000419 1843 1528 1520 1789 1144 1946 2017 2794 1455 2258
and correct by this one: ?
> df[as.vector(genes), ]
# 1 2 3 4 5 6 7 8 9 10
# ENSG00000000003 2407 2345 1052 2191 2542 812 3595 4215 1100 5457
# ENSG00000000460 453 493 884 1180 338 541 606 650 520 3479
# ENSG00000001084 3705 6465 1803 49162 2018 1161 4621 8359 3375 2678
Rownames of df are strings, but in another dataframe I have the same names as factors. To have correct results I have to put it into as.vector() all the time.
Can you tell me what is the logic of the first result?

factors are internally numbers. So when you are trying to subset the dataframe using factor it returns you the first 3 results of your dataframe. Check
(1:10)[factor(genes)]
#[1] 1 2 3
So here from sequence 1:10 it returns to you first 3 values.
This works for dataframes as well,
mtcars[factor(genes), ]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
If genes are rownames of your dataframe you can subset your dataframe directly by doing
df[genes, ]

Related

How can i reorder a variable having categorical values in dplyr [duplicate]

This question already has answers here:
Reorder rows using custom order
(2 answers)
Closed 6 years ago.
I have done some manipulations as below to arrive at the following dataframe:
df
cluster.kmeans variable max mean median min sd
1 1 MonthlySMS 191 90.32258 71.0 8 56.83801
2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
3 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
4 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
5 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
6 2 MonthlySMS 155 53.18815 57.0 1 31.64533
7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
8 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
9 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
10 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
11 3 MonthlySMS 215 135.60202 137.0 49 34.09794
12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
13 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
14 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
15 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
16 4 MonthlySMS 136 49.93980 54.5 1 31.47778
17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
18 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157
19 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
20 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
I would like to order the variable column based on these strings:
top.vars_kmeans
[1] "ThreeMonthlySMS" "SixMonthlyData" "ThreeMonthlyData"
[4] "MonthlySMS" "SixMonthlySMS"
I could do it using sqldf as below:
library(sqldf)
a <- c(1,2,3,4,5)
a <- data.frame(top.vars_kmeans,a)
a <- sqldf('select a1.* ,b1.a from "MS.DATA.STATS.KMEANS" a1 inner join a b1
on a1.variable=b1."top.vars_kmeans"')
a <- sqldf('select * from a order by "cluster.kmeans",a')
a$a <- NULL
a
cluster.kmeans variable max mean median min sd
1 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
3 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
4 1 MonthlySMS 191 90.32258 71.0 8 56.83801
5 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
6 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
8 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
9 2 MonthlySMS 155 53.18815 57.0 1 31.64533
10 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
11 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
13 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
14 3 MonthlySMS 215 135.60202 137.0 49 34.09794
15 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
16 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
18 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
19 4 MonthlySMS 136 49.93980 54.5 1 31.47778
20 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157
I am just curious to know if this could be achieved using dplyr......my understanding of this wonderful package will get enhanced....
need help here!
We can use arrange with match
library(dplyr)
a %>%
arrange(cluster.kmeans, match(variable, top.vars_kmeans))
# cluster.kmeans variable max mean median min sd
#1 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
#2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
#3 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
#4 1 MonthlySMS 191 90.32258 71.0 8 56.83801
#5 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
#6 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
#7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
#8 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
#9 2 MonthlySMS 155 53.18815 57.0 1 31.64533
#10 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
#11 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
#12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
#13 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
#14 3 MonthlySMS 215 135.60202 137.0 49 34.09794
#15 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
#16 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
#17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
#18 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
#19 4 MonthlySMS 136 49.93980 54.5 1 31.47778
#20 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157
you can redefine a factor (or ordered factor) with the levels in desired order (e.g. as stored in top.vars_kmeans):
a$variable <- factor(a$variable, levels = top.vars_kmeans)
See also the help page online, or via ?factor.
If you desire to order the whole data.frame, go by the answer of akrun.
You can try group_by and slice:
df %>% group_by(cluster.kmeans) %>% slice(match(top.vars_kmeans, variable))
# cluster.kmeans variable max mean median min sd
# (int) (fctr) (int) (dbl) (dbl) (int) (dbl)
#1 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
#2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
#3 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
#4 1 MonthlySMS 191 90.32258 71.0 8 56.83801
#5 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
#6 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
#7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
#8 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
#9 2 MonthlySMS 155 53.18815 57.0 1 31.64533
#10 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
#11 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
#12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
#13 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
#14 3 MonthlySMS 215 135.60202 137.0 49 34.09794
#15 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
#16 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
#17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
#18 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
#19 4 MonthlySMS 136 49.93980 54.5 1 31.47778
#20 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157

How can I call for something in a data.frame when the destinction has to be done in two columns?

Sorry for the very specific question, but I have a file as such:
Adj Year man mt wm wmt by bytl gr grtl
3 careless 1802 0 126 0 54 0 13 0 51
4 careless 1803 0 166 0 72 0 1 0 18
5 careless 1804 0 167 0 58 0 2 0 25
6 careless 1805 0 117 0 5 0 5 0 7
7 careless 1806 0 408 0 88 0 15 0 27
8 careless 1807 0 214 0 71 0 9 0 32
...
560 mean 1939 21 5988 8 1961 0 1152 0 1512
561 mean 1940 20 5810 6 1965 1 914 0 1444
562 mean 1941 10 6062 4 2097 5 964 0 1550
563 mean 1942 8 5352 2 1660 2 947 2 1506
564 mean 1943 14 5145 5 1614 1 878 4 1196
565 mean 1944 42 5630 6 1939 1 902 0 1583
566 mean 1945 17 6140 7 2192 4 1004 0 1906
Now I have to call for specific values (e.g. [careless,1804,man] or [mean, 1944, wmt].
Now I have no clue how to do that, one possibility would be to split the data.frame and create an array if I'm correct. But I'd love to have a simpler solution.
Thank you in advance!
Subsetting for specific values in Adj and Year column and selecting the man column will give you the required output.
df[df$Adj == "careless" & df$Year == 1804, "man"]

How to dynamically select columns

Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)

How to transform particular rows into columns in R

I'm new to R and my question might seem easy for most of you. I have a data like this
> data.frame(table(dat),total)
AGEintervals mytest.G_B_FLAG Freq total
1 (1,23] 0 5718 5912
2 (23,26] 0 5249 5579
3 (26,28] 0 3105 3314
4 (28,33] 0 6277 6693
5 (33,37] 0 4443 4682
6 (37,41] 0 4277 4514
7 (41,46] 0 4904 5169
8 (46,51] 0 4582 4812
9 (51,57] 0 4039 4236
10 (57,76] 0 3926 4031
11 (1,23] 1 194 5912
12 (23,26] 1 330 5579
13 (26,28] 1 209 3314
14 (28,33] 1 416 6693
15 (33,37] 1 239 4682
16 (37,41] 1 237 4514
17 (41,46] 1 265 5169
18 (46,51] 1 230 4812
19 (51,57] 1 197 4236
20 (57,76] 1 105 4031
As you might have noticed age intervals start to repeating on 11 row.
All I need is to get 10 rows and 0's and 1' in different columns. Like this
AGEintervals 1 0 total
1 (1,23] 194 5718 5912
2 (23,26] 330 5249 5579
3 (26,28] 209 3105 3314
4 (28,33] 416 6277 6693
5 (33,37] 239 4443 4682
6 (37,41] 237 4277 4514
7 (41,46] 265 4904 5169
8 (46,51] 230 4582 4812
9 (51,57] 197 4039 4236
10 (57,76] 105 3926 4031
Many thanks
This is a straightforward "long" to "wide" transformation that is easy to achieve with reshape from base R:
reshape(mydf, idvar = c("AGEintervals", "total"),
timevar = "mytest.G_B_FLAG", direction = "wide")
# AGEintervals total Freq.0 Freq.1
# 1 (1,23] 5912 5718 194
# 2 (23,26] 5579 5249 330
# 3 (26,28] 3314 3105 209
# 4 (28,33] 6693 6277 416
# 5 (33,37] 4682 4443 239
# 6 (37,41] 4514 4277 237
# 7 (41,46] 5169 4904 265
# 8 (46,51] 4812 4582 230
# 9 (51,57] 4236 4039 197
# 10 (57,76] 4031 3926 105
Other alternatives include:
reshape2
library(reshape2)
dcast(mydf, ... ~ mytest.G_B_FLAG, value.var='Freq')
tidyr
library(tidyr)
spread(df, mytest.G_B_FLAG, Freq)
Update
This problem is possibly avoidable in the first place.
Run the following example code and compare the output at each stage:
## Create some sample data
set.seed(1)
dat <- data.frame(V1 = sample(letters[1:3], 20, TRUE),
V2 = sample(c(0, 1), 20, TRUE))
## View the output
dat
## Look what happens when we use `data.frame` on a `table`
data.frame(table(dat))
## Compare it with `as.data.frame.matrix`
as.data.frame.matrix(table(dat))
## The total can be added automatically with `addmargins`
as.data.frame.matrix(addmargins(table(dat), 2, sum))

R: order data frame according to one column

I have a data like this
> bbT11
range X0 X1 total BR GDis BDis WOE IV Index
1 (1,23] 5718 194 5912 0.03281461 12.291488 8.009909 0.42822753 1.83348973 1.534535
2 (23,26] 5249 330 5579 0.05915039 11.283319 13.625103 -0.18858848 0.44163352 1.207544
3 (26,28] 3105 209 3314 0.06306578 6.674549 8.629232 -0.25685394 0.50206815 1.292856
4 (28,33] 6277 416 6693 0.06215449 13.493121 17.175888 -0.24132650 0.88874916 1.272937
5 (33,37] 4443 239 4682 0.05104656 9.550731 9.867878 -0.03266713 0.01036028 1.033207
6 (37,41] 4277 237 4514 0.05250332 9.193895 9.785301 -0.06234172 0.03686928 1.064326
7 (41,46] 4904 265 5169 0.05126717 10.541702 10.941371 -0.03721203 0.01487247 1.037913
8 (46,51] 4582 230 4812 0.04779717 9.849527 9.496284 0.03652287 0.01290145 1.037198
9 (51,57] 4039 197 4236 0.04650614 8.682287 8.133774 0.06526000 0.03579599 1.067437
10 (57,76] 3926 105 4031 0.02604813 8.439381 4.335260 0.66612734 2.73386708 1.946684
I need to add an additional column "Bin" that will show numbers from 1 to 10, depending on BR column being in descending order, so for example 10th row becomes first, then first row becomes second, etc.
Any help would be appreciated
A very straightforward way is to use one of the rank functions from "dplyr" (eg: dense_rank, min_rank). Here, I've actually just used rank from base R. I've deleted some columns below just for presentation purposes.
library(dplyr)
mydf %>% mutate(bin = rank(BR))
# range X0 X1 total BR ... Index bin
# 1 (1,23] 5718 194 5912 0.03281461 ... 1.534535 2
# 2 (23,26] 5249 330 5579 0.05915039 ... 1.207544 8
# 3 (26,28] 3105 209 3314 0.06306578 ... 1.292856 10
# 4 (28,33] 6277 416 6693 0.06215449 ... 1.272937 9
# 5 (33,37] 4443 239 4682 0.05104656 ... 1.033207 5
# 6 (37,41] 4277 237 4514 0.05250332 ... 1.064326 7
# 7 (41,46] 4904 265 5169 0.05126717 ... 1.037913 6
# 8 (46,51] 4582 230 4812 0.04779717 ... 1.037198 4
# 9 (51,57] 4039 197 4236 0.04650614 ... 1.067437 3
# 10 (57,76] 3926 105 4031 0.02604813 ... 1.946684 1
If you just want to reorder the rows, use arrange instead:
mydf %>% arrange(BR)
bbT11$Bin[order(bbT11$BR)] <- 1:nrow(bbT11)

Resources