Deleting certain rows from a data frame - r

The data is confidential so here is a dummy data frame for example.
i1 i2 o1
1 72 3.1 69
2 12 1.1 46
3 16 2.0 37
4 16 7.9 70
5 24 7.0 27
6 12 9.9 49
I want to divide this data frame into 3 data frames of fixed sizes but the rows must be selected without replacement. here, say I select a random part of it using :-
x=sample(6,3);
df_part1=df[x,]
The rows selected at random are :-
i1 i2 o1
4 16 2.0 37
6 12 9.9 49
1 72 3.1 69
Now, before I select the 2nd part, I want to delete these specific rows from the data frame. How do I go about it?

It sounds like you actually want to split your dataframe, not delete rows from it. If the dataframes are of equal sizes and you want the three extracted dataframes to be random samples, specify something like:
split(df, sample(1:3,dim(df)[1],TRUE))
to get a list of the three sampled, mutually exclusive dataframes. No need to delete anything from the original dataframe.
Also, if you want to have the dataframes have different sizes, you can specify a prob argument in sample.

You could sample 1:6 first and then extract the information from the shuffled 6 numbers:
tmp <- sample(6, 6)
tmp[1:3], tmp[4:6] will give you the information, and you could go from there. I hope this helps.

Related

¿How do apply weights to my data frame in r?

So I want is to apply weights to my observations from my data frame, also I already have an entire column with the weights that I want to apply to my data.
So this how my data frame looks like.
weight
count
3
67
7
355
8
25
7
2
And basically what I want is to weight each value of my COUNT column with the respective weight of my column WEIGHT. For example, the value 67 of of my column Count should be weighted by 3 and the value of 355 of my column Count should be weighted by 7 and so on.
I try to use this code from the questionr package.
wtd.table(data1$count, weights = data1$weight)
But this code altered my data frame and end up reducing my 1447 rows to just 172 entries. What I want is to maintain my exact number of entries.
The output that I want, would be something like this. I just want to add another column to my data frame with the weighted values.
Count
Count applying weights
67
####
355
###
I am still not sure how to apply weights to the count data in the way you want.
I just want to show that you can create a new column based on the previous column in a convenient way by using dplyr. For example:
mydf
# weight count
# 1 3 67
# 2 7 355
# 3 8 25
# 4 7 2
mydf %>% mutate(weightedCount = weight*count,
percentRank = percent_rank(weightedCount),
cumDist = cume_dist(weightedCount))
# weight count weightedCount percentRank cumDist
# 1 3 67 201 0.6666667 0.75
# 2 7 355 2485 1.0000000 1.00
# 3 8 25 200 0.3333333 0.50
# 4 7 2 14 0.0000000 0.25
Here, weightedCount is multiplication of weight and count, percentRank shows the rank of each data in weightedCount and cumDist shows cumulative distribution of the data in weightedCount.
This is an example. You can create another column and apply other functions in the similar way.

Referencing different coloumn as ranges between two data frames

I have one data frame/ list that gives and ID and a number
1. 25
2. 36
3. 10
4. 18
5. 12
This first list is effectively a list of objects with the number of objects contained in each eg. bricks in a wall, so a a list or walls with the number of bricks in each.
I have a second that contains a a full list of the objects being referred to in that above list and a second attribute for each.
1. 3
2. 4
3. 2
4. 8
5. 5
etc.
in the weak example I'm stringing together this would be a list of the weight of each brick in all walls.
so my first list give me the ranges i would like to average in the second list, or I would like as an end result a list of walls with the average weight of each brick per wall.
ie average the attributes of 1-25, 26-62 ... 89-101
my idea so far was to create a data frame with two coloumns
1. 1 25
2. 26 62
3. n
4. n
5. 89 101
and then attempt to create a third column that uses the first two as x and y in a mean(table2$coloumn1[x:y]) type formula, but I can't get anything to work.
the end result could probably looks something like this
1. 3.2
2. 6.5
3. 3
4. 7.9
5. 8.5
is there a way to do it like this or does anyone have a more elegant solution.
You could do something like this... set the low and high limits of your ranges and then use mapply to work out the mean over the appropriate rows of df2.
df1 <- data.frame(id=c(1,2,3,4,5),no=c(25,36,10,18,12))
df2 <- data.frame(obj=1:100,att=sample(1:10,100,replace=TRUE))
df1$low <- cumsum(c(1,df1$no[-nrow(df1)]))
df1$high <- pmin(cumsum(df1$no),nrow(df2))
df1$meanatt <- mapply(function(l,h) mean(df2$att[l:h]), df1$low, df1$high)
df1
id no low high meanatt
1 1 25 1 25 4.760000
2 2 36 26 61 5.527778
3 3 10 62 71 5.800000
4 4 18 72 89 5.111111
5 5 12 90 100 4.454545

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Merging Two Datasets, Removing recurring columns, Adding a new column

I have two datasets :
Here's the 1st, and Here's the 2nd.
My aim is to merge these datas, removing the 1st or 2nd "JN" column since it's recurring, and find the ratio of "Freq" between these datas.
For each Row, I want to use this calculation :
=(100)-(100*(FreqBL/FreqB))
and place this new calculation to 4th column.
The new data should look like:
JN FreqBL FreqB Success Ratio
4 10 33 69.6969
But I don't know how to select all rows seperately and the necessary code for the calculation.
Thanks
You want to merge the data sets. For the next time, I will recommend you provide a small reproducible example.
> new.dt <- merge(dt1, dt2)
> new.dt$"Success ratio" <- with(new.dt, 100-(100 * (FreqBL/FreqB)))
> head(new.dt)
JN FreqB FreqBL Success ratio
1 4 33 10 69.69697
2 8 49 10 79.59184
3 10 44 13 70.45455
4 11 38 7 81.57895
5 13 29 3 89.65517
6 17 15 10 33.33333

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources