Take the following patient data example from a hospital.
YEAR <- sample(1980:1995,15, replace=T)
Pat_ID <- sample(1:100,15)
sex <- c(1,0,1,0,1,0,0,1,0,0,0,0,1,0,0)
df1 <- data.frame(Pat_ID,YEAR,sex)
I want to introduce a dummy variable $PAIR_IDENTIFIER that takes a new value each time a new sex==1 appears. The problem is there is no constant patern to the sex variable.
You see sometimes the succeeding 1 appears in the ith+2 position and then ith+3 position etc.
so $PAIR_IDENTIFIER <- c(1,1,2,2,3,3,3,4,4,4,4,4 .....)
You can do this by simply using the cumsum,
df1$PAIR_IDENTIFIER <- cumsum(df1$sex)
df1
# Pat_ID YEAR sex PAIR_IDENTIFIER
#1 54 1991 1 1
#2 100 1992 0 1
#3 6 1995 1 2
#4 99 1994 0 2
#5 42 1988 1 3
#6 65 1990 0 3
#7 53 1994 0 3
#8 96 1987 1 4
Related
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
I have two datasets – data A and data B. Data A contains 30.000 observations while data B has 10.000 observations. Both datasets have 156 countries – noted with their ISO–number.
I want to add some of the variables in data B to data A (let's say the variable Y*). However, I face problems when merging these two datasets.
Below you can see the samples of the datasets
Data A
Country ISO year X
A 1 1990 0
A 1 1991 0
A 1 1992 0
A 1 1993 0
A 1 1994 1
B 2 1990 0
B 2 1991 0
B 2 1992 0
B 2 1993 0
B 2 1994 1
Data B
Country ISO year Y*
A 1 1990 1
A 1 1994 0
B 2 1990 1
B 2 1992 0
So I am interested in getting the variable Y* into my data A. To be more precise, I want to add it by country and year.
Below you see the code that I use to add the Y* variable. I have used this code many times and it works perfectly. I cannot figure out why it doesn't work in this case.
variables <- c("Country", "year", "Y*")
newdata <- merge(DataA, DataB[,variables], by=c("Country","Year"), all.x=TRUE)
When I run this code, I get "newdata" with the variable Y* but with 5 times more rows than Data A.
Question: Is there any relatively simple and efficient ways of doing this properly? Is there something with the structure of dataset B that creates more rows? In any ways, I am grateful for all kinds of suggestions that could solve this problem.
This is the outcome I want to get:
Country ISO year X Y*
A 1 1990 0 1
A 1 1991 0 0
A 1 1992 0 0
A 1 1993 0 0
A 1 1994 1 0
B 2 1990 0 1
B 2 1991 0 0
B 2 1992 0 0
B 2 1993 0 0
B 2 1994 1 0
Using the merge. Make sure to readjust the values of the Y* variable
z <- merge(DataA,DataB, by = intersect(names(DataA), names(DataB)), all = TRUE)
require(dplyr)
left_join(DataA,DataB %>% select(Country,year,Y*), by=c("Country"="Country","year"="year"))
I have unbalanced panel data with a binary variable indicating if the event occurred or not. I want to control for time dependency. The way to do this is to control for the time it that had elapsed since the event has occured before.
Here is a reproducible example, with a vector of what I am trying to achieve. Thanks!
id year onset time_since_event
1 1 1989 0 0
2 1 1990 0 1
3 1 1991 1 2
4 1 1992 0 0
5 1 1993 0 1
6 1 1994 0 2
7 2 1989 0 0
8 2 1990 1 1
9 2 1991 0 0
10 2 1992 1 1
11 2 1993 0 2
12 2 1994 0 3
13 3 1991 0 0
14 3 1992 0 1
15 3 1993 0 2
˚
id <- c(1,1,1,1,1,2,2,2,2,3,3)
year <- c(1989,1990,1991,1992,1993,1994,1989,1990,1991,1992,1993,1994,1991,1992,1993)
onset <- c(0,0,1,0,0,0,0,1,0,1,0,0,0,0)
time_since_event<-c(0,1,2,0,1,2,0,1,2,3,0,1,2) #what I want to create
df <- data.frame(cbind(id, year, onset,time_since_event))
Try this:
id <- c(1,1,1,1,1,2,2,2,2,3,3)
year <- c(1989,1990,1991,1992,1993,1989,1990,1991,1992,1991,1992)
onset <- c(0,0,1,0,0,0,1,0,1,0,0)
period <- c(0, cumsum(onset)[-length(onset)])
time_since_event <- ave(year, id, period, FUN=function(x) x-x[1])
df <- data.frame(id, year, onset, time_since_event)
I created a variable called period which describes the different periods until each event. It doesn't matter that the periods overlap patients, since we're going to group by patient and by period, so the count will start over if it's a new patient or a new period.
Using the ave() function allows us to assign values within each grouping. Here we're analyzing year based on the grouping variables id and period. The function I used at the end just subtracts the first value from the current value within each grouping.
I have a dataframe with a KEY/ID column, a year column, two variables V1 and V2.
KEY V1 V2 YEAR
1 10 5 1990
1 20 10 1991
1 30 15 1992
2 40 20 1990
2 50 25 1991
2 60 30 1992
I would like to compute the percent change for the values of V1 from one year to another one. That is, I would like to compute (V1[i+1]-V1[i])/V1[i] but only when the value in KEY[i+1] is equal to the value of KEY[i]. When they are different, I would like to get a NA.
KEY V1 V2 YEAR CHANGE
1 10 5 1990 1
1 20 10 1991 1
1 30 15 1992 NA
2 40 20 1990 0.25
2 50 25 1991 0.2
2 60 30 1992 NA
This is my attempt by using the Delt function from the quantmode package and ddply from plyr.
data$change <- ddply(data, "data$KEY", transform, DeltaCol=Delt(data$V1) )
Unfortunately, it doesn't do the trick.
Any help would be appreciated.
I don't know how to do it with ddply but it's pretty easy with ave:
> dat$pctchg <- ave(dat$V1, dat$KEY, FUN=function(x) c( NA, diff(x)/x[-length(x)]) )
> dat
KEY V1 V2 YEAR pctchg
1 1 10 5 1990 NA
2 1 20 10 1991 1.00
3 1 30 15 1992 0.50
4 2 40 20 1990 NA
5 2 50 25 1991 0.25
6 2 60 30 1992 0.20
ave works when you want a result that depends only on one vector within any number of categories. As far as I know you cannot have multiple vector calculations with ave nor do you have access to the factor levels within hte function. If you want the same calculation(s) on all of a group of vectors considered separately, then aggregate is the best; and finally if you want calculations that each depend on on multiple vectors use either do.call(rbvind, by(dat ,cats, function)) or lapply( split(dat, cats), function)
I am trying to figure out a way to loop through my data frame and replace any values greater than 200 with a decimal point.
Here is my code:
for (i in data$AGE) if (i > 199) i <- i*.01-2
Here is a head() sample of my data frame:
AGE LOC RACE SEX WORKREL PROD1 ICD10 INJ_ST DTH_YEAR DTH_MONTH DTH_DAY ACC_YEAR ACC_MONTH ACC_DAY
1 26 5 1 1 0 1290 V865 UT 2003 1 1 2002 12 31
2 20 1 7 2 0 1899 X47 HI 2003 1 1 2003 1 1
3 202 1 2 2 0 1598 W75 FL 2003 1 1 2003 1 1
4 86 5 1 2 0 1807 W18 FL 2003 1 1 2002 12 14
5 203 1 2 1 0 1598 W75 GA 2003 1 1 2003 1 1
6 79 0 1 2 2 921 X49 MA 2003 1 1 NA NA NA
So basically, if the value of AGE is greater than 200, then I want to multiply that value by .01 and then subtract 2.
My reason is because any value with 200 and greater is the age in months.
I'm not a Stats or R genius so my humble thanks in advance for all advice.
data$AGE[data$AGE> 200] <- data$AGE[data$AGE > 200] * 0.01 - 2
You can do this reasonably eleganty within and replace
data <- within(data, AGE <- replace(AGE, AGE > 200, AGE[AGE>200] * 0.01-2))
Or using data.table for memory efficiency and syntax elegance
library(data.table)
DT <- as.data.table(data)
# make sure that AGE is numeric not integer
DT[,AGE:= as.numeric(AGE)]
DT[AGE>200, AGE := AGE *0.01 -2]