'Forward' cumulative sum in dplyr - r

When examining datasets from longitudinal studies, I commonly get results like this from a dplyr analysis chain from the raw data:
df = data.frame(n_sessions=c(1,2,3,4,5), n_people=c(59,89,30,23,4))
i.e. a count of how many participants have completed a certain number of assessments at this point in time.
Although it is useful to know how many people have completed exactly n sessions, we more often need to know how many have completed at least n sessions. As per the table below, a standard cumulative sum isn't appropriate, What we want are the values in the n_total column, which is a sort of "forwards cumulative sum" of the values in the n_people column. i.e. the value in each row should be the sum of the values of itself and all values beyond it, rather than the standard cumulative sum, which is the sum of all values up to and including itself:
n_sessions n_people n_total cumsum
1 59 205 59
2 89 146 148
3 30 57 178
4 23 27 201
5 4 4 205
Generating the cumulative sum is simple:
mutate(df, cumsum = cumsum(n_people))
What would be an expression for generating a "forwards cumulative sum" that could be incorporated in a dplyr analysis chain? I'm guessing that cumsum would need to be applied to n_people after sorting by n_sessions descending, but can't quite get my head around how to get the answer while preserving the original order of the data frame.

You can take a cumulative sum of the reversed vector, then reverse that result. The built-in rev function is helpful here:
mutate(df, rev_cumsum = rev(cumsum(rev(n_people))))
For example, on your data this returns:
n_sessions n_people rev_cumsum
1 1 59 205
2 2 89 146
3 3 30 57
4 4 23 27
5 5 4 4

Related

Matching two datasets using different IDs

I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!

¿How do apply weights to my data frame in r?

So I want is to apply weights to my observations from my data frame, also I already have an entire column with the weights that I want to apply to my data.
So this how my data frame looks like.
weight
count
3
67
7
355
8
25
7
2
And basically what I want is to weight each value of my COUNT column with the respective weight of my column WEIGHT. For example, the value 67 of of my column Count should be weighted by 3 and the value of 355 of my column Count should be weighted by 7 and so on.
I try to use this code from the questionr package.
wtd.table(data1$count, weights = data1$weight)
But this code altered my data frame and end up reducing my 1447 rows to just 172 entries. What I want is to maintain my exact number of entries.
The output that I want, would be something like this. I just want to add another column to my data frame with the weighted values.
Count
Count applying weights
67
####
355
###
I am still not sure how to apply weights to the count data in the way you want.
I just want to show that you can create a new column based on the previous column in a convenient way by using dplyr. For example:
mydf
# weight count
# 1 3 67
# 2 7 355
# 3 8 25
# 4 7 2
mydf %>% mutate(weightedCount = weight*count,
percentRank = percent_rank(weightedCount),
cumDist = cume_dist(weightedCount))
# weight count weightedCount percentRank cumDist
# 1 3 67 201 0.6666667 0.75
# 2 7 355 2485 1.0000000 1.00
# 3 8 25 200 0.3333333 0.50
# 4 7 2 14 0.0000000 0.25
Here, weightedCount is multiplication of weight and count, percentRank shows the rank of each data in weightedCount and cumDist shows cumulative distribution of the data in weightedCount.
This is an example. You can create another column and apply other functions in the similar way.

Add Elements of Data Frame to Another Data Frame Based on Condition R

I have two data frames that showcase results of an analysis from one month and then the subsequent month.
Here is a smaller version of the data:
Jan19=data.frame(Group=c(589,630,523,581,689),Count=c(191,84,77,73,57))
Dec18=data.frame(Group=c(589,630,523,478,602),Count=c(100,90,50,6,0))
Jan19
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
Dec18
Group Count
1 589 100
2 630 90
3 523 50
4 478 6
5 602 0
Jan19 only has counts >0. Dec18 is the dataset with results from the previous month. Dec18 has counts >=0 for each group. I have been referencing the full Dec18 dataset for counts =0 and manually entering them in to the full Jan18 dataset. I want to rid myself of the manual part of this exercise and just be able to append the groups with counts = 0 to the end of the Jan19 dataset.
That lead me to the following code to perform what I described above:
GData=rbind(Jan19,Dec18)
GData=GData[!duplicated(GData$Group),]
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
Gdata
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
9 478 6
10 602 0
Essentially, I wanted that 6 to show up as a 0. So, that lead me to the following line of code where I wanted to set a condition, if the new appended data (Dec18) has a duplicate Group to the newer data (Jan19), then that corresponding Count should=0. Otherwise, the value of count from the Jan19 dataset should hold.
Gdata=ifelse(Dec18$Group %in% Jan19$Group==FALSE, Gdata$Count==0,Jan19$Count)
This is resulting in errors and I'm not sure how to modify it to achieve my desired result. Any help would be appreciated!
Your rbind/deduplication approach is a good one, you just need the Dec18 data you rbind on to have have the Count column as 0:
Gdata = rbind(Jan19, transform(Dec18, Count = 0))
Gdata[!duplicated(Gdata$Group), ]
# Group Count
# 1 589 191
# 2 630 84
# 3 523 77
# 4 581 73
# 5 689 57
# 9 478 0
# 10 602 0
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
This is incorrect. !duplicated() will keep the first occurrence and remove later occurrences. None of the Jan19 data is removed---we can see that the first 5 rows of Gdata are exactly the 5 rows of Jan19. The only issue was that the non-duplicated rows from Dec18 were not all 0 counts. We fix this with the transform().
There are plenty of other ways to do this, with a join using the merge function, we could only rbind on the non-duplicated groups as d.b suggests, rbind(Jan19, transform(Dec18, Count = 0)[!Dec18$Group %in% Jan19$Group,]), and there are others too. We could make your ifelse approach work like this:
Gdata = rbind(Jan19, Dec18)
Gdata$Count = ifelse(!Dec18$Group %in% Jan19$Group, 0, Gdata$Count)
# an alternative to ifelse, a little cleaner
Gdata = rbind(Jan19, Dec18)
Gdata$Count[!Gdata$Group %in% Jan19$Group] = 0
Use whatever makes the most sense to you.

Referencing different coloumn as ranges between two data frames

I have one data frame/ list that gives and ID and a number
1. 25
2. 36
3. 10
4. 18
5. 12
This first list is effectively a list of objects with the number of objects contained in each eg. bricks in a wall, so a a list or walls with the number of bricks in each.
I have a second that contains a a full list of the objects being referred to in that above list and a second attribute for each.
1. 3
2. 4
3. 2
4. 8
5. 5
etc.
in the weak example I'm stringing together this would be a list of the weight of each brick in all walls.
so my first list give me the ranges i would like to average in the second list, or I would like as an end result a list of walls with the average weight of each brick per wall.
ie average the attributes of 1-25, 26-62 ... 89-101
my idea so far was to create a data frame with two coloumns
1. 1 25
2. 26 62
3. n
4. n
5. 89 101
and then attempt to create a third column that uses the first two as x and y in a mean(table2$coloumn1[x:y]) type formula, but I can't get anything to work.
the end result could probably looks something like this
1. 3.2
2. 6.5
3. 3
4. 7.9
5. 8.5
is there a way to do it like this or does anyone have a more elegant solution.
You could do something like this... set the low and high limits of your ranges and then use mapply to work out the mean over the appropriate rows of df2.
df1 <- data.frame(id=c(1,2,3,4,5),no=c(25,36,10,18,12))
df2 <- data.frame(obj=1:100,att=sample(1:10,100,replace=TRUE))
df1$low <- cumsum(c(1,df1$no[-nrow(df1)]))
df1$high <- pmin(cumsum(df1$no),nrow(df2))
df1$meanatt <- mapply(function(l,h) mean(df2$att[l:h]), df1$low, df1$high)
df1
id no low high meanatt
1 1 25 1 25 4.760000
2 2 36 26 61 5.527778
3 3 10 62 71 5.800000
4 4 18 72 89 5.111111
5 5 12 90 100 4.454545

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Resources