Stata: counting for this period and next period - count

I have a data set similar to this and want to create a count variable under these conditions. If mig is 1 then count how many of un1 or un2 or un3 are 1 and also count how many of un1 or un2 or un3 in the next time period are 1. So
I want it to be the count of un* in this period and the next for each individual.
I use the code
egen ... anycount(un1-un3) if mig ==1 & (un1|un2|un3||f.un1|f.un2 |f.un3)
but I cannot get a count of future values.
Id t mig un1 un2 un3 count
1 1 0 0 1 1
1 2 0 0 0 1
1 3 1 0 0 1 4
1 4 0 1 1 1
1 5 0 0 0 0
2 1 0 0 1 0
2 2 1 0 0 0
2 3 0 1 0 0 1
2 4 0 0 0 1
2 5 0 0 0 0

To explain this more fully: You have panel data and have tsset Id t or similarly using xtset.
How about
gen count = cond(mig == 1, un1 + un2 + un3 + F.un1 + F.un2 + F.un3, 0)
???

Related

Summarizing/counting multiple binary variables

For the purpose of this question, my data set includes 16 columns (c1_d, c2_d, ..., c16_d) and 364 rows (1-364). This is what it briefly looks like:
c1_d c2_d c3_d c4_d c5_d c6_d c7_d c8_d c9_d c10_d c11_d c12_d c13_d c14_d c15_d c16_d
1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0
2 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0
3 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0
4 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0
5 1 0 1 1 1 1 0 1 0 1 1 0 0 0 1 0
Please note that for example row 1, has five 1s and 11 0s.
This is what I'm trying to do: Basically counting how many rows have how many of the value 1 assigned to them (i.e. by the end of this analysis I want to get something like 20 rows had zero value 1 assigned to them, 33 rows had one value 1 assigned to them, 100 rows had 10 value 1 assigned to them, etc.).
I tried to create a data frame including all rows (364) and columns (16) I needed. I tried using the print.data.frame function, and its results are shown above, but it doesn't give me the number of 0s and 1s per row. I tried using functions such as table, ftable, and xtab, but they don't really work for more than three variables.
I would highly appreciate your help on this.
If I understand correctly:
library(dplyr)
library(tidyr)
df %>%
transmute(count0 = rowSums(df==0),
count1 = rowSums(df==1)) %>%
pivot_longer(everything()) %>%
count(name, value)
name value n
<chr> <dbl> <int>
1 count0 5 1
2 count0 6 1
3 count0 7 1
4 count0 11 1
5 count0 12 1
6 count1 4 1
7 count1 5 1
8 count1 9 1
9 count1 10 1
10 count1 11 1

How to create multiple new columns based of off groups of columns that start with a certain prefix and also contain a certain string?

I have data that look like this
df <- data.frame(ID = c(1,2,3,4,5,6),
var1_unmod = c (1,0,0,1,0,1),
var1_me1 = c(0,1,0,0,0,0),
var1_me2 = c(1,1,1,0,1,0),
var1_me3 = c(0,0,1,0,0,0),
var1_ac1 = c(1,0,1,1,0,1),
var2_unmod = c(1,0,1,1,0,0),
var2_me1 = c(0,0,0,0,1,0),
var2_me2 = c(1,1,0,1,1,1),
var2_ac1 = c(1,1,0,1,0,0),
var2_me1ac1 = c(1,0,0,0,0,0),
var2_me2ac1 = c(1,0,0,1,1,1))
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1
1 1 1 0 1 0 1 1 0 1 1 1 1
2 2 0 1 1 0 0 0 0 1 1 0 0
3 3 0 0 1 1 1 1 0 0 0 0 0
4 4 1 0 0 0 1 1 0 1 1 0 1
5 5 0 0 1 0 0 0 1 1 0 0 1
6 6 1 0 0 0 1 0 0 1 0 0 1
except that in the actual dataset, the prefixes aren't sequential like var1 and var2, they are basically random combinations of letters and numbers, and there are about 30 different ones.
For each of these prefixes (var1, var2, ...), I need to create a single variable that indicates whether any of the columns with that prefix that also contain me1, me2, or me3 (so for var2 this would be var2_me1, var2_me2, var2_me1ac1, var2_me2ac1) are nonzero. The output dataset would have additional columns like this:
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var1_meX var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1 var2_meX
1 1 1 0 1 0 1 1 1 0 1 1 1 1 1
2 2 0 1 1 0 0 1 0 0 1 1 0 0 1
3 3 0 0 1 1 1 1 1 0 0 0 0 0 0
4 4 1 0 0 0 1 0 1 0 1 1 0 1 1
5 5 0 0 1 0 0 1 0 1 1 0 0 1 1
6 6 1 0 0 0 1 0 0 0 1 0 0 1 1
First I need to identify the applicable columns for each prefix (because there is no pattern to the prefixes, I'm thinking I will have to hard code at least this part), and then maybe somehow write a loop that iterates through the columns (stored in a vector?) for each prefix. I tend to have trouble referencing varying column names within loops. Any help is appreciated!
Here is a basic approach:
cols <- colnames(df)
varnames <- c("var1", "var2")
df2 <- df
for (i in varnames) {
newname <- paste(i, "meX", sep="_")
df2[, newname] <- apply(df2[, grepl(i, cols) & grepl("me", cols)], 1, sum)
df2[, newname] <- ifelse(df2[, newname] >= 1, 1, 0)
}
This will probably need to be modified based on the specific details of your data.
Define unique group of columns in cols, use lapply to iterate over each unique value and return 1 if there is atleast one 1 in the row in '_me' columns.
all_cols <- names(df)
cols <- c('var1', 'var2')
df[paste0(cols, '_meX')] <- lapply(cols, function(x)
as.integer(rowSums(df[grep(paste0(x, '_me'), all_cols, value = TRUE)]) > 0))
The new columns look like :
df[13:14]
# var1_meX var2_meX
#1 1 1
#2 1 1
#3 1 0
#4 0 1
#5 1 1
#6 0 1

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

counting the occurrences of a number and when it occurred in R data.frame and data.table

I have newly started to learn R, so my question may be utterly ridiculous. I have a data frame
data<- data.frame('number'=1:11, 'col1'=sample(10:20),'col2'=sample(10:20),'col3'=sample(10:20),'col4'=sample(10:20),'col5'=sample(10:20), 'date'= c('12-12-2014','12-11-2014','12-10-2014','12-09-2014', '12-08-2014','12-07-2014','12-06-2014','12-05-2014','12-04-2014', '12-04-2014', '12-03-2014') )
The number column is an 'id' column and the last column is a date.
I want to count the number of times that each number occurs across (not per column, but the whole data frame containing data) the columns 2:6 and when they occurred.
I am stuck on the first part having tried the following using data.table:
count <- function(){
i = 1
DT <-data.table(data[2:6])
for (i in 10:20){
DT[, .N, by =i]
i = i + 1
}
}
which gives an error that I don't begin to understand
Error in `[.data.table`(DT, , .N, by = i) :
The items in the 'by' or 'keyby' list are length (1). Each must be same length as rows in x or number of rows returned by i (11)
Can someone help, please. Also with the second part that I have not even attempted yet i.e. associating a date or a row number with each occurrence of a number
Perhaps you may want this
library(reshape2)
table(melt(data[,-1], id.var='date')[,-2])
# value
#date 10 11 12 13 14 15 16 17 18 19 20
# 12-03-2014 0 0 1 0 0 1 0 0 1 2 0
# 12-04-2014 2 0 0 2 2 0 1 0 1 1 1
# 12-05-2014 0 0 0 0 0 0 1 1 2 0 1
# 12-06-2014 1 1 0 0 0 1 0 1 0 0 1
# 12-07-2014 0 1 0 1 0 1 1 1 0 0 0
# 12-08-2014 1 1 0 0 1 0 0 1 1 0 0
# 12-09-2014 0 0 2 0 1 2 0 0 0 0 0
# 12-10-2014 0 0 1 1 0 0 1 0 0 1 1
# 12-11-2014 0 1 1 0 0 0 1 0 0 1 1
# 12-12-2014 1 1 0 1 1 0 0 1 0 0 0
Or if you need a data.table solution (from #Arun's comments)
library(data.table)
dcast.data.table(melt(setDT(data),
id="date", measure=2:6), date ~ value)

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources