R: subset data.frame by another vector - r

I have a dataframe with 241 rows. It is called master and it looks like this:
Patient Sample PDMax FileName
1 1.1 6 GSM1
1 1.2 6 GSM2
2 2.1 8 GSM3
3 3.1 5 GSM4
3 3.2 7 GSM5
Now I have a vector called Biopsy with the important samples. I would like to subset the master dataframe, so that only the important informations are left.
This is the vector biopsy:
1.2 2.1 3.2
The result should be like this:
Patient Sample PDMax FileName
1 1.2 6 GSM2
2 2.1 8 GSM3
3 3.2 7 GSM5
How can I do that? I tried different things like merge() or subset(), but everything failed.
Thanks!

Have a look at the data wrangling verbs inside dplyr. Hadley Wickham's book is a great place to start (http://r4ds.had.co.nz/transform.html#filter-rows-with-filter)
library (dplyr)
master %>% filter(Sample %in% Biopsy)

Related

How to rank data from multiple rows and columns?

Example data:
>data.frame("A" = c(20,40,53), "B" = c(40,11,60))
What's the easiest way in R to get from this
A B
1 20 40
2 40 11
3 53 60
to this?
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0
I couldn't find a way to make rank() or frank() work on multiple rows/columns and googling things like "r rank dataframe" "r rank multiple rows" yielded only questions on how to rank multiple rows/columns individually, which is weird, as I suspect the question must have been answered before.
Try rank like below
df[] <- rank(df)
or
df <- list2DF(relist(rank(df),skeleton = unclass(df)))
and you will get
> df
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0

Calculation via factor results in a by-list - how to circumvent?

I have a data.frame as following:
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 NA
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
My goal is simple but also a bit difficult. Definitely it is doable to solve it in several ways:
I want to apply a function "func" to each row according to a factor, e.g. the factor "Lot". This is done via
m_dist_lot<- by(data.frame, data.frame$Lot,func)
This actually works but the result is a by-list:
data.frame$Lot: 7
354 355 363 367 378 419 426 427 428 431 460 477 836
3.5231249 9.4229589 1.4996504 7.2984485 7.6883170 1.2354754 1.8547674 3.1129814 4.4303001 1.9634573 3.7281868 3.6182559 6.4718306
data.frame$Lot: 8
1 2 11 15 17 18 19 20 21 22 24 25
2.1415352 4.6459868 1.3485551 38.8218984 3.9988686 2.2473563 6.7186047 2.6433790 0.5869746 0.5832567 4.5321623 1.8567318
The first row seems to be the row of the initial data.frame where the data is taken from. The second row are the calculated values.
My problem now is: How can I store these values properly into the origin data.frame according to the correct rows?
For example in case of one certain calculation/row of the data frame:
m_dist_lot<- by(data.frame, data.frame$Lot,func)
results for the second row of the data.frame in
data.frame$Lot: 8
2
4.6459868
I want to store the value 4.6459868 in data.frame$m_dist_lot according to the correct row "2":
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 4.6459868
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
but I don't know how. My best try actually is to use "unlist".
un<- unlist(m_dist_lot) results in
un[1]
6.354
3.523125
un[2]
6.355
9.422959
un[3]
(..)
But I still don't know how I can "separate" the information of "factor.row" and "calculcated" value in such a way that the information is stored correctly in the data frame.
At least when using un<- unlist(m_dist_lot, use.names = FALSE) the factors are not present:
un[1]
3.523125
un[2]
9.422959
un[3]
1.49965
(..)
But now I lack the information of how to assign these values properly into the data.frame.
Using un<- do.call(rbind, lapply(m_dist_lot, data.frame, stringsAsFactors=FALSE)) results in
(...)
7.922 0.94130936
7.976 4.89560441
8.1 2.14153516
8.2 4.64598677
8.11 1.34855514
(...)
Here I still lack a proper assignment of calculated values <> data.frame.
I'm sure there must be a doable way. Do you know a good method?
Without reproducible data or an example of what you want func to do, I am guessing a bit here. However, I think that dplyr is going to be the answer for you.
First, I am going to use the pipe (%>%) from dplyr (exported from magrittr) to pass the builtin iris data through a series of functions. If what you are trying to calculate requires the full data.frame (and not just a column or two), you could modify this approach to do what you want (just write your function to take a data.frame, add the column(s) of interest, then return the full data.frame).
Here, I first split the iris data by Species (this creates a list, with a separate data.frame for each species). Next, I use lapply to run the function head on each element of the list. This returns a list of data.frames that now each only have three rows. (You could replace head with your function of interest here, as long as it returns a full data.frame.) Finally, I stitch each element of the list back together with bind_rows.
topIris <-
iris %>%
split(.$Species) %>%
lapply(head, n = 3) %>%
bind_rows()
This returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6.0 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3.0 5.9 2.1 virginica
Which I am going to use to illustrate the approach that I think will actually address your underlying problem.
The group_by function from dplyr allows a similar approach, but without having to split the data.frame. When a data.frame is grouped, any functions applied to it are applied separately by group. Here is an example in action, which ranks the sepal lengths within each species. This is obviously not terribly useful directly, but you could write a custom function which took any number of columns as arguments (which are then passed in as vectors) and returned a vector of the same length (to create a new column or update an existing one). The select function at the end is only there to make it easier to see what I did
topIris %>%
group_by(Species) %>%
mutate(rank_Sepal_Length = rank(Sepal.Length)) %>%
select(Species, rank_Sepal_Length, Sepal.Length)
Returns:
Species rank_Sepal_Length Sepal.Length
<fctr> <dbl> <dbl>
1 setosa 3 5.1
2 setosa 2 4.9
3 setosa 1 4.7
4 versicolor 3 7.0
5 versicolor 1 6.4
6 versicolor 2 6.9
7 virginica 2 6.3
8 virginica 1 5.8
9 virginica 3 7.1
I got a workaround with the help of Force gsub to keep trailing zeros :
un<- do.call(rbind, lapply(list, data.frame, stringsAsFactors=FALSE))
un<- gsub(".*.","", un)
un<- regmatches(un, gregexpr("(?<=.).*", un, perl=TRUE))
rows<- data.frame(matrix(ncol = 1, nrow = lengths(un)))
colnames(rows)<- c("row_number")
rows["row_number"]<- sprintf("%s", rownames(un))
rows["row_number"]<- as.numeric(un[,1])
rows["row_number"]<- sub("^[^.]*[.]", "", format(rows[,1], width = max(nchar(rows[,1]))))

R: repeat series of numbers within groups a number of times that differs among groups

I have a data frame that looks something like the one below, which I'll call data frame 1. There is no regular pattern to the number of rows associated with each number in the “tank” column (or the other columns for that matter).
#code for making data frame 1
tank<-c(1,1,2,3,3,3,4,4)
size<-c(2.1,3.5,2.3,4.0,3.3,2.2,1.9,3.0)
mass<-c(6.5,5.5,5.9,7.2,4.9,8.0,9.1,6.3)
df1<-data.frame(cbind(tank,size,mass))
I need to repeat the sequence of values found in the "size" and "mass" columns within each tank. However, the number of repeats for each tank's sequence will differ (again in no particular pattern). I have another data frame (data frame 2) that contains the number of repeats for each tank's sequence, and it looks something like this:
#code for making data frame 2
tank<-c(1,2,3,4)
rpeat<-c(3,1,2,2)
df2<-data.frame(cbind(tank,rpeat))
Ultimately, my goal is to have a data frame like this (see below). Each series of values within a tank is repeated a number of times equal to that specified in data frame 2.
#code for making data frame 3
tank<-c(1,1,1,1,1,1,2,3,3,3,3,3,3,4,4,4,4)
size<-c(2.1,3.5,2.1,3.5,2.1,3.5,2.3,4.0,3.3,2.2,4.0,3.3,2.2,1.9,3.0,1.9,3.0)
mass<-c(6.5,5.5,6.5,5.5,6.5,5.5,5.9,7.2,4.9,8.0,7.2,4.9,8.0,9.1,6.3,9.1,6.3)
df3<-data.frame(cbind(tank,size,mass))
I have figured out a somewhat crude way to do this when each number in the size and mass columns is just repeated a specified number of times (see below) but not how to create the repeating series that I need.
#code to make data frame 4
tank<-c(1,1,1,1,1,1,2,3,3,3,3,3,3,4,4,4,4)
size2<-c(2.1,2.1,2.1,3.5,3.5,3.5,2.3,4.0,4.0,3.3,3.3,2.2,2.2,1.9,1.9,3.0,3.0)
mass2<-c(6.5,6.5,6.5,5.5,5.5,5.5,5.9,7.2,7.2,4.9,4.9,8.0,8.0,9.1,9.1,6.3,6.3)
df4<-data.frame(cbind(tank,size,mass))
To make the above data frame, I took the data frame below, which combines data frames 1 and 2, and applied the code below.
#code to produce data frame 5
tank<-c(1,1,2,3,3,3,4,4)
size<-c(2.1,3.5,2.3,4.0,3.3,2.2,1.9,3.0)
mass<-c(6.5,5.5,5.9,7.2,4.9,8.0,9.1,6.3)
rpeat<-c(3,3,1,2,2,2,2,2)
df5<-data.frame(cbind(tank,size,mass,rpeat))
#code to produce data frame 4 from data frame 5
tank_col <- rep(df5$tank, times = df5$rpeat)
size_col <- rep(df5$size, times = df5$rpeat)
mass_col <- rep(df5$mass, times = df5$rpeat)
goal <-data.frame(cbind(tank_col,size_col,mass_col))
Sorry this is so long, but I have a hard time explaining what I need to do without providing examples. Thanks in advance for any help you can provide.
You can use data.table, and
library(data.table)
# create df1 and df2 as data.tables keyed by tank
DT1 <- data.table(df1, key = 'tank')
DT2 <- data.table(df2, key = 'tank')
# you can now join on tank, and repeat all columns in
# .SD (the subset of the data.table)
DT1[DT2, lapply(.SD, rep, times = rpeat)]
# 1: 1 2.1 6.5
# 2: 1 3.5 5.5
# 3: 1 2.1 6.5
# 4: 1 3.5 5.5
# 5: 1 2.1 6.5
# 6: 1 3.5 5.5
# 7: 2 2.3 5.9
# 8: 3 4.0 7.2
# 9: 3 3.3 4.9
# 10: 3 2.2 8.0
# 11: 3 4.0 7.2
# 12: 3 3.3 4.9
# 13: 3 2.2 8.0
# 14: 4 1.9 9.1
# 15: 4 3.0 6.3
# 16: 4 1.9 9.1
# 17: 4 3.0 6.3
Read the vignettes associated with data.table to get a full understanding of what is going on.
What we are doing is called by-without-by within the vignettes.

how to sort a column in a table in r

I tried to merge two tables, but the result is like this,
subj gamble_gamble n_gambles expected_value
1 19 32 1.7
10 3 4 1.5
100 3 4 1.5
101 6 32 1.4
102 3 4 1.5
103 19 32 1.7
The subj column isn't ordered in usual way (e.g. 1,2,3,4,5,6). I tried to order the subj column with this command:
newdata <- table3[order(subj),]
but it doesnt work. Can somebody help me?
Use this:
newdata <- table3[order(as.numeric(as.character(table3$subj))),]
This works even if subj is a factor (not just character).

Creating a series of vectors from a vector

I have a simple two vector dataframe (length=30) that looks something like this:
> mDF
Param1 w.IL.L
1 AuZgFw 0.5
2 AuZfFw 2
3 AuZgVw 74.3
4 AuZfVw 20.52
5 AuTgIL 80.9
6 AuTfIL 193.3
7 AuCgFL 0.2
8 ...
I'd like to use each of the rows to form 30 single value numeric vectors with the name of the vector taken from mDF$Param1, so that:
> AuZgFw
[1] 0.5
etc
I've tried melting and casting, but I suspect there may be an easier way?
The simplest/shortest way is to apply assign over rows:
mDF <- read.table(textConnection("
Param1 w.IL.L
1 AuZgFw 0.5
2 AuZfFw 2
3 AuZgVw 74.3
4 AuZfVw 20.52
5 AuTgIL 80.9
6 AuTfIL 193.3
7 AuCgFL 0.2
"),header=T,stringsAsFactors=F)
invisible(apply(mDF,1,function(x)assign(x[[1]],as.numeric(x[[2]]),envir = .GlobalEnv)))
This involves converting the second column of the data frame to and from a string. invisible is there only to suppress the output of apply.
EDIT: You can also use mapply to avoid coersion to/from strings:
invisible(mapply(function(x,y)assign(x,y,envir=.GlobalEnv),mDF$Param1,mDF$w.IL.L))

Resources