Creating a series of vectors from a vector - r

I have a simple two vector dataframe (length=30) that looks something like this:
> mDF
Param1 w.IL.L
1 AuZgFw 0.5
2 AuZfFw 2
3 AuZgVw 74.3
4 AuZfVw 20.52
5 AuTgIL 80.9
6 AuTfIL 193.3
7 AuCgFL 0.2
8 ...
I'd like to use each of the rows to form 30 single value numeric vectors with the name of the vector taken from mDF$Param1, so that:
> AuZgFw
[1] 0.5
etc
I've tried melting and casting, but I suspect there may be an easier way?

The simplest/shortest way is to apply assign over rows:
mDF <- read.table(textConnection("
Param1 w.IL.L
1 AuZgFw 0.5
2 AuZfFw 2
3 AuZgVw 74.3
4 AuZfVw 20.52
5 AuTgIL 80.9
6 AuTfIL 193.3
7 AuCgFL 0.2
"),header=T,stringsAsFactors=F)
invisible(apply(mDF,1,function(x)assign(x[[1]],as.numeric(x[[2]]),envir = .GlobalEnv)))
This involves converting the second column of the data frame to and from a string. invisible is there only to suppress the output of apply.
EDIT: You can also use mapply to avoid coersion to/from strings:
invisible(mapply(function(x,y)assign(x,y,envir=.GlobalEnv),mDF$Param1,mDF$w.IL.L))

Related

R: subset data.frame by another vector

I have a dataframe with 241 rows. It is called master and it looks like this:
Patient Sample PDMax FileName
1 1.1 6 GSM1
1 1.2 6 GSM2
2 2.1 8 GSM3
3 3.1 5 GSM4
3 3.2 7 GSM5
Now I have a vector called Biopsy with the important samples. I would like to subset the master dataframe, so that only the important informations are left.
This is the vector biopsy:
1.2 2.1 3.2
The result should be like this:
Patient Sample PDMax FileName
1 1.2 6 GSM2
2 2.1 8 GSM3
3 3.2 7 GSM5
How can I do that? I tried different things like merge() or subset(), but everything failed.
Thanks!
Have a look at the data wrangling verbs inside dplyr. Hadley Wickham's book is a great place to start (http://r4ds.had.co.nz/transform.html#filter-rows-with-filter)
library (dplyr)
master %>% filter(Sample %in% Biopsy)

R - create new vectors based on elements of existing vector

and thanks in advance for your help. I am very new to R and am having some trouble with code that, to me looks like it should work, but isn't. I have a data frame like the one below:
studentID classNumber classRating
7 1 4
7 2 4
7 4 3
79 1 5
79 2 3
116 1 5
116 2 4
134 1 5
134 3 5
134 4 5
And I want it to read like this:
Student ID class1 class2 class3 class4
7 4 4 NA 3
79 5 3 NA NA
116 5 4 NA NA
134 5 NA 5 5
I've tried to piece together different things that I've come across and it seemed like the best approach was to create a new data frame and matrix and then populate it from the current data frame. I came up with the broken code below:
classRatings = data.frame(matrix(NA,4,5))
for(i in 1:nrow(classDB)){
#Find ratings by each student
rowsToReplace = classDB$studentID==classRatings$studentID[i]
#Make a row for each unique studentID in classRatings
classDB$studentID[rowsToReplace] = classRatings$studentID[i]
#for each studentID, find put the given rating for each unique class into
#it's own vector
for(j in classDB$classNumber){
if(classDB$classNumber==1){classRatings$class1==classDB$classRating}[j]
if(classDB$classNumber==2){classRatings$class2==classDB$classRating}[j]
if(classDB$classNumber==3){classRatings$class3==classDB$classRating}[j]
if(classDB$classNumber==4){classRatings$class4==classDB$classRating}[j]
if(classDB$classNumber==5){classRatings$class5==classDB$classRating}[j]
}
}
I'm getting an error that says:
the condition has length > 1 and only the first element will be used
and I am beyond my skill level to figure it out. Any help is appreciated.
The tidyr package can spread this long table into a wider one:
library(tidyr)
spread(classDB,classNumber,classRating,fill=NA)

R Programming Calculate Rows Average

How to use R to calculate row mean ?
Sample data:
f<- data.frame(
name=c("apple","orange","banana"),
day1sales=c(2,5,4),
day1sales=c(2,8,6),
day1sales=c(2,15,24),
day1sales=c(22,51,13),
day1sales=c(5,8,7)
)
Expected Results :
Subsequently the table will add more column for example the expected results is only until AverageSales day1sales.4. After running more data, it will add on to day1sales.6 and so on. So how can I count the average for all the rows?
with rowMeans
> rowMeans(f[-1])
## [1] 6.6 17.4 10.8
You can also add another column to of means to the data set
> f$AvgSales <- rowMeans(f[-1])
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 AvgSales
## 1 apple 2 2 2 22 5 6.6
## 2 orange 5 8 15 51 8 17.4
## 3 banana 4 6 24 13 7 10.8
rowMeans is the simplest way. Also the function apply will apply a function along the rows or columns of a data frame. In this case you want to apply the mean function to the rows:
f$AverageSales <- apply(f[, 2:length(f)], 1, mean)
(changed 6 to length(f) since you say you may add more columns).
will add an AverageSales column to the dataframe f with the value that you want
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 means
##1 apple 2 2 2 22 5 6.6
##2 orange 5 8 15 51 8 17.4
##3 banana 4 6 24 13 7 10.8

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

loop over columns with semi like columnnames

I have the following variable and dataframe
welltypes <- c("LC","HC")
qccast <- data.frame(
LC_mean=1:10,
HC_mean=10:1,
BC_mean=rep(0,10)
)
Now I only want to see the welltypes I selected(in this case LC and HC, but it could also be different ones.)
for(i in 1:length(welltypes)){
qccast$welltypes[i]_mean
}
This does not work, I know.
But how do i loop over those columns?
And it has to happen variable wise, because welltypes is of an unkown size.
The second argument to $ needs to be a column name of the first argument. I haven't run the code, but I would expect welltypes[i]_mean to be a syntax error. $ is similar to [[, so you can use paste to create the column name string and subset via [[.
For example:
qccast[[paste(welltypes[i],"_mean",sep="")]]
Depending on the rest of your code, you may be able to do something like this instead.
for(i in paste(welltypes,"_mean",sep="")){
qccast[[i]]
}
Here's another strategy:
qccast[ sapply(welltypes, grep, names(qccast)) ]
LC_mean HC_mean
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
Another easy way to access given welltypes
qccast[,paste(welltypes, '_mean', sep = "")]

Resources