I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)
Related
I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!
So I want is to apply weights to my observations from my data frame, also I already have an entire column with the weights that I want to apply to my data.
So this how my data frame looks like.
weight
count
3
67
7
355
8
25
7
2
And basically what I want is to weight each value of my COUNT column with the respective weight of my column WEIGHT. For example, the value 67 of of my column Count should be weighted by 3 and the value of 355 of my column Count should be weighted by 7 and so on.
I try to use this code from the questionr package.
wtd.table(data1$count, weights = data1$weight)
But this code altered my data frame and end up reducing my 1447 rows to just 172 entries. What I want is to maintain my exact number of entries.
The output that I want, would be something like this. I just want to add another column to my data frame with the weighted values.
Count
Count applying weights
67
####
355
###
I am still not sure how to apply weights to the count data in the way you want.
I just want to show that you can create a new column based on the previous column in a convenient way by using dplyr. For example:
mydf
# weight count
# 1 3 67
# 2 7 355
# 3 8 25
# 4 7 2
mydf %>% mutate(weightedCount = weight*count,
percentRank = percent_rank(weightedCount),
cumDist = cume_dist(weightedCount))
# weight count weightedCount percentRank cumDist
# 1 3 67 201 0.6666667 0.75
# 2 7 355 2485 1.0000000 1.00
# 3 8 25 200 0.3333333 0.50
# 4 7 2 14 0.0000000 0.25
Here, weightedCount is multiplication of weight and count, percentRank shows the rank of each data in weightedCount and cumDist shows cumulative distribution of the data in weightedCount.
This is an example. You can create another column and apply other functions in the similar way.
I have one data frame/ list that gives and ID and a number
1. 25
2. 36
3. 10
4. 18
5. 12
This first list is effectively a list of objects with the number of objects contained in each eg. bricks in a wall, so a a list or walls with the number of bricks in each.
I have a second that contains a a full list of the objects being referred to in that above list and a second attribute for each.
1. 3
2. 4
3. 2
4. 8
5. 5
etc.
in the weak example I'm stringing together this would be a list of the weight of each brick in all walls.
so my first list give me the ranges i would like to average in the second list, or I would like as an end result a list of walls with the average weight of each brick per wall.
ie average the attributes of 1-25, 26-62 ... 89-101
my idea so far was to create a data frame with two coloumns
1. 1 25
2. 26 62
3. n
4. n
5. 89 101
and then attempt to create a third column that uses the first two as x and y in a mean(table2$coloumn1[x:y]) type formula, but I can't get anything to work.
the end result could probably looks something like this
1. 3.2
2. 6.5
3. 3
4. 7.9
5. 8.5
is there a way to do it like this or does anyone have a more elegant solution.
You could do something like this... set the low and high limits of your ranges and then use mapply to work out the mean over the appropriate rows of df2.
df1 <- data.frame(id=c(1,2,3,4,5),no=c(25,36,10,18,12))
df2 <- data.frame(obj=1:100,att=sample(1:10,100,replace=TRUE))
df1$low <- cumsum(c(1,df1$no[-nrow(df1)]))
df1$high <- pmin(cumsum(df1$no),nrow(df2))
df1$meanatt <- mapply(function(l,h) mean(df2$att[l:h]), df1$low, df1$high)
df1
id no low high meanatt
1 1 25 1 25 4.760000
2 2 36 26 61 5.527778
3 3 10 62 71 5.800000
4 4 18 72 89 5.111111
5 5 12 90 100 4.454545
I am having a hard time figuring how to program this in R: Given a number of X and Y pairs, such as
X Y
9 1
1 2
12 3
8 4
9 4
4 5
16 6
18 7
5 8
11 9
4 10
6 11
6 12
14 13
18 13
20 13
13 14
20 15
20 16
I need to sample randomly n pairs that fulfil the condition that Xs and Ys are unique. For instance, if n=3 and using the data above, the following combinations (9,1) (4,5) (4,10) or (1,2) (14,13) (20,13) will be invalid because X=4 or Y=13 are duplicated in each of the solutions. However, (9,1) (1,2) and (8,4) will be a valid solution because the Xs and Ys are unique. Any help will be moooooost welcome.
If you start by sampling (randomizing) the rows of your original data, then subset only those rows where X or Y are not duplicated and then select the first, last or any n (=3) number of rows (you could use sample again), you should be fine, I think.
set.seed(1) # for reproducibility
head(subset(df[sample(nrow(df)),], !duplicated(X) & !duplicated(Y)), 3)
# X Y
#6 4 5
#7 16 6
#10 11 9
In response to the comment by #Richo64, saying that this approach will not randomly select the pairs:
It does sample the pairs randomly because the first (inner most) thing I do is
df[sample(nrow(df)),]
which samples the rows of the data randomly. Now, once we have done that, it's a random process which, say 4, in column X will come first and therefore will remain in the data because the other 4 is removed since it is a duplicated entry in X.
The same applies to values in Y.
It's obvious then, that after the sampling and subsetting, you are free to choose any 3 rows of the remaining data and even if you always selected the first 3 rows, you would still get a random selection that will differ every time you run it (except when it coincidently samples the same rows again).
I am working with a large dataset and I am trying to first identify clusters of values that meet specific threshold values. My aim then is to only keep clusters of a minimum length. Below is some example data and my progress thus far:
Test = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
Sequence = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
Value = c(3,2,3,4,3,4,4,5,5,2,2,4,5,6,4,4,6,2,3,2)
Data <- data.frame(Test, Sequence, Value)
Using package evd, I have identified clusters of values >3
C1 <- clusters(Data$Value, u = 3, r = 1, cmax = F, plot = T)
Which produces
C1
$cluster1
4
4
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6
My problem is twofold:
1) I don't know how to relate this back to the original dataframe (for example to Test A & B)
2) How can I only keep clusters with a minimum size of 3 (thus excluding Cluster 1)
I have looked into various filtering options etc. however they do not cluster data according to a desired threshold, with no options for the minimum size of the cluster either.
Any help is much appreciated.
Q1: relate back to original dataframe: Have a look at Carl Witthoft's answer. He wrote a variant of rle() (seqle() because it allows one to look for integer sequences rather than repetitions): detect intervals of the consequent integer sequences
Q2: only keep clusters of certain length:
C1[sapply(C1, length) > 3]
yields the 2 clusters that are long enough:
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6