Referencing different coloumn as ranges between two data frames - r

I have one data frame/ list that gives and ID and a number
1. 25
2. 36
3. 10
4. 18
5. 12
This first list is effectively a list of objects with the number of objects contained in each eg. bricks in a wall, so a a list or walls with the number of bricks in each.
I have a second that contains a a full list of the objects being referred to in that above list and a second attribute for each.
1. 3
2. 4
3. 2
4. 8
5. 5
etc.
in the weak example I'm stringing together this would be a list of the weight of each brick in all walls.
so my first list give me the ranges i would like to average in the second list, or I would like as an end result a list of walls with the average weight of each brick per wall.
ie average the attributes of 1-25, 26-62 ... 89-101
my idea so far was to create a data frame with two coloumns
1. 1 25
2. 26 62
3. n
4. n
5. 89 101
and then attempt to create a third column that uses the first two as x and y in a mean(table2$coloumn1[x:y]) type formula, but I can't get anything to work.
the end result could probably looks something like this
1. 3.2
2. 6.5
3. 3
4. 7.9
5. 8.5
is there a way to do it like this or does anyone have a more elegant solution.

You could do something like this... set the low and high limits of your ranges and then use mapply to work out the mean over the appropriate rows of df2.
df1 <- data.frame(id=c(1,2,3,4,5),no=c(25,36,10,18,12))
df2 <- data.frame(obj=1:100,att=sample(1:10,100,replace=TRUE))
df1$low <- cumsum(c(1,df1$no[-nrow(df1)]))
df1$high <- pmin(cumsum(df1$no),nrow(df2))
df1$meanatt <- mapply(function(l,h) mean(df2$att[l:h]), df1$low, df1$high)
df1
id no low high meanatt
1 1 25 1 25 4.760000
2 2 36 26 61 5.527778
3 3 10 62 71 5.800000
4 4 18 72 89 5.111111
5 5 12 90 100 4.454545

Related

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Deleting certain rows from a data frame

The data is confidential so here is a dummy data frame for example.
i1 i2 o1
1 72 3.1 69
2 12 1.1 46
3 16 2.0 37
4 16 7.9 70
5 24 7.0 27
6 12 9.9 49
I want to divide this data frame into 3 data frames of fixed sizes but the rows must be selected without replacement. here, say I select a random part of it using :-
x=sample(6,3);
df_part1=df[x,]
The rows selected at random are :-
i1 i2 o1
4 16 2.0 37
6 12 9.9 49
1 72 3.1 69
Now, before I select the 2nd part, I want to delete these specific rows from the data frame. How do I go about it?
It sounds like you actually want to split your dataframe, not delete rows from it. If the dataframes are of equal sizes and you want the three extracted dataframes to be random samples, specify something like:
split(df, sample(1:3,dim(df)[1],TRUE))
to get a list of the three sampled, mutually exclusive dataframes. No need to delete anything from the original dataframe.
Also, if you want to have the dataframes have different sizes, you can specify a prob argument in sample.
You could sample 1:6 first and then extract the information from the shuffled 6 numbers:
tmp <- sample(6, 6)
tmp[1:3], tmp[4:6] will give you the information, and you could go from there. I hope this helps.

Merging Two Datasets, Removing recurring columns, Adding a new column

I have two datasets :
Here's the 1st, and Here's the 2nd.
My aim is to merge these datas, removing the 1st or 2nd "JN" column since it's recurring, and find the ratio of "Freq" between these datas.
For each Row, I want to use this calculation :
=(100)-(100*(FreqBL/FreqB))
and place this new calculation to 4th column.
The new data should look like:
JN FreqBL FreqB Success Ratio
4 10 33 69.6969
But I don't know how to select all rows seperately and the necessary code for the calculation.
Thanks
You want to merge the data sets. For the next time, I will recommend you provide a small reproducible example.
> new.dt <- merge(dt1, dt2)
> new.dt$"Success ratio" <- with(new.dt, 100-(100 * (FreqBL/FreqB)))
> head(new.dt)
JN FreqB FreqBL Success ratio
1 4 33 10 69.69697
2 8 49 10 79.59184
3 10 44 13 70.45455
4 11 38 7 81.57895
5 13 29 3 89.65517
6 17 15 10 33.33333

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources