Conditional subsetting gone wrong in R - r

So I have this fairly basic problem with R subsetting, but because I'm a newbie I don't know how to solve it properly. There's example of some panel data I have:
idnr year sales space municipality pop
1 1 2004 110000 1095 136 71377
2 1 2005 110000 1095 136 71355
3 1 2006 110000 1095 136 71837
4 1 2007 120000 1095 136 72956
5 2 2004 35000 800 136 71377
6 3 2004 45000 1000 136 71377
7 3 2005 45000 1000 2584 23135
8 3 2006 45000 1000 2584 23258
9 3 2007 45000 1000 2584 23407
10 4 2005 180000 5000 2584 23254
11 4 2006 220000 5000 2584 23135
12 4 2007 250000 5000 2584 23258
So my problem is that I want to subset data using conditions for both year = 2004 AND (not or) year = 2005. However it doesn't seem to work. Code:
tab3 <- stores[stores$year==2004 & stores$year==2005, c("idnr","year")]
What I am trying to say is that I need to select data which existed in both 2004 and 2005, cause some entries existed either in 2004 or 2005, but not in both and hence should be excluded. Using data above as an example, this should be the output:
idnr year
1 2004
1 2005
3 2004
3 2005
Update:
I was hoping that akrun's method may work for selecting data entries, which appeared ONLY in 2005. Such that:
idnr year
4 2005
Unfortunately, it doesn't. Instead it groups both idnr's which appeared in 2004&2005 with those which appeared only in 2005. Any ideas?

Here is a an option using "data.table". Convert the dataset ("df") to "data.table" using setDT. Set the "year" column as "key" (setkey(..)). Subset the rows that have "2004/2005" in the "year" columns (J(c(2004,..)), select the first two columns 1:2.
library(data.table) # data.table_1.9.5
DT1 <- setkey(setDT(df),year)[J(c(2004,2005)), 1:2, with=FALSE]
DT1
# idnr year
#1: 1 2004
#2: 2 2004
#3: 3 2004
#4: 1 2005
#5: 3 2005
#6: 4 2005
Update
Based on the updated expected result, we can check whether there are more than one unique "year" entries (uniqueN(year)>1) per "idnr" group, get the row index (.I) as a column ("V1") and subset the data.table "DT1".
DT1[DT1[, .I[uniqueN(year)>1], idnr]$V1,]
# idnr year
#1: 1 2004
#2: 1 2005
#3: 3 2004
#4: 3 2005
Or everything in one liner
setDT(df)[year %in% 2004:2005, if(uniqueN(year) > 1L) year, idnr]
# idnr V1
# 1: 1 2004
# 2: 1 2005
# 3: 3 2004
# 4: 3 2005
Or a base R option would be
indx <- with(df, ave(year==2004, idnr, FUN=any)& ave(year==2005,
idnr, FUN=any) & year %in% 2004:2005)
df[indx,1:2]
# idnr year
#1 1 2004
#2 1 2005
#6 3 2004
#7 3 2005
Update2
Based on the dataset and the expected result showed, we can check whether the first value of "year" is 2005 for each group "idnr". If it is TRUE, then subset the first observation (.SD[1L,..]) and select the columns that are needed.
setDT(df)[,if(year[1L]==2005) .SD[1L,1,with=FALSE], by = idnr]
# idnr year
#1: 4 2005
Or
setDT(df)[df[,.I[year[1L]==2005] , by = idnr]$V1[1L], 1:2, with=FALSE]
# idnr year
#1: 4 2005

If you want to subset with either year == 2004 or year == 2005, you need to use the | operator instead of & in your actual approach:
tab3 <- stores[stores$year == 2004 | stores$year == 2005, c("idnr", "year")]
Which results:
#> tab3
# idnr year
#1 1 2004
#2 1 2005
#5 2 2004
#6 3 2004
#7 3 2005
#10 4 2005
Or using dplyr:
library(dplyr)
tab3 <- stores %>% select(idnr, year) %>% filter(year == 2004 | year == 2005)
More concisely:
tab3 <- stores %>% select(idnr, year) %>% filter(year %in% c(2004, 2005))

Related

Dropping variables grouped if statement and creating time to event variable in R

I am trying to work with a big dataset and started using R for such purpose. I am trying to create a variable named time to diagnosis (time_to_dx) which is a time to event variable (yearHb - Dx) but for each patient ID. I would also like to drop all those measurements done prior the diagnosis but i am guessing once I am able to create the time_to_dx variable, that should be straightforward.
I am attaching an example of the dataset and the expected outcome.
Many thanks for your help.
ID
Dx
Hb
yearHb
time_to_dx
1
2001
16.5
1997
1
2001
21.3
2002
1
1
2001
19.5
2005
4
2
2005
14.5
2002
2
2005
15.6
2004
2
2005
21
2006
1
2
2005
22
2007
2
2
2005
17.9
2003
3
2006
18.1
2003
3
2006
19.7
2006
0
3
2006
19.1
2008
2
3
2006
17.3
2007
1
Assuming that dt is the name of your data.frame
Using dplyr:
library(dplyr)
result <- dt %>%
mutate(
time_to_dx = yearHb - Dx,
time_to_dx = ifelse(time_to_dx < 0, NA, time_to_dx)
)
Using base R:
dt$time_to_dx = dt$yearHb - dt$Dx
dt$time_to_dx = ifelse(dt$time_to_dx < 0, NA, dt$time_to_dx )

replacing NA with next available number within a group

I have a relatively large dataset and I want to replace NA value for the price in a specific year and for a specific ID number with an available value in next year within a group for the same ID number. Here is a reproducible example:
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
ID year value
1 1 2000 1000
2 2 2001 20000
3 3 2002 30000
4 2 2002 NA
5 2 2003 40000
6 3 2007 NA
7 1 2001 6000
8 4 2000 4000
9 5 2005 NA
10 5 2006 20000
11 1 2002 7000
12 2 2004 50000
13 2 2005 60000
So, for example for ID=2 we have following value and years:
ID year value
2 2001 20000
2 2002 NA
2 2003 40000
2 2004 50000
2 2005 60000
So in the above case, NA should be replaced with 40000 (Values in next year). And the same story for other IDs.
the final result should be in this form:
ID year value
1 2000 1000
1 2001 6000
1 2002 7000
2 2001 20000
2 2002 40000
2 2003 40000
2 2004 50000
2 2005 60000
3 2007 NA
4 2000 4000
5 2005 20000
5 2006 20000
Please note that for ID=3 since there is no next year available, we want to leave it as is. That's why it's in the form of NA
I appreciate if you can suggest a solution
Thanks
dplyr solution
library(tidyverse)
data2 <- data %>%
dplyr::group_by(ID) %>%
dplyr::arrange(year) %>%
dplyr::mutate(replaced_value = ifelse(is.na(value), lead(value), value))
print(data2)
# A tibble: 13 x 4
# Groups: ID [5]
ID year value replaced_value
<dbl> <dbl> <dbl> <dbl>
1 1 2000 1000 1000
2 4 2000 4000 4000
3 2 2001 20000 20000
4 1 2001 6000 6000
5 3 2002 30000 30000
6 2 2002 NA 40000
7 1 2002 7000 7000
8 2 2003 40000 40000
9 2 2004 50000 50000
10 5 2005 NA 20000
11 2 2005 60000 60000
12 5 2006 20000 20000
13 3 2007 NA NA
Try this tidyverse approach using a flag to check sequential years and fill() to complete data:
library(tidyverse)
#Data
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
#Code
data2 <- data %>% arrange(ID,year) %>%
group_by(ID) %>%
mutate(Flag=c(1,diff(year))) %>%
fill(value,.direction = 'downup') %>%
mutate(value=ifelse(Flag!=1,NA,value)) %>% select(-Flag)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 20000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
You could do:
library(dplyr)
data %>%
group_by(ID) %>%
mutate(value = coalesce(value, as.integer(sapply(pmin(year + 1, max(year)), function(x) value[year == x])))) %>%
arrange(ID, year)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 40000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
Now in case you want to replace NA with any value that follows immediately - i.e. even if the year is not necessarily consecutive - you could do:
library(tidyverse)
data %>%
arrange(ID, year) %>%
group_by(ID, idx = cumsum(is.na(value))) %>%
fill(value, .direction = 'up') %>%
ungroup %>%
select(-idx)
This is much more straightforward (and likely much faster) in data.table:
library(data.table)
setDT(data)[order(ID, year), ][
, value := nafill(value, type = 'nocb'), by = .(ID, cumsum(is.na(value)))]

Calculation and replacement in R

I have a dataset like the following and I need to compare the value of each year (2005-2009) with the average value of (2002-2004).
Year Firm R
2002 A 30
2003 A 11
2004 A 1
2005 A 7
2006 A 15
2007 A 20
2008 A 3.5
2009 A 8
2002 B 24
2003 B 30
2004 B 25
2005 B 5.2
2006 B 11.8
2007 B 78
2008 B 90
2009 B 57
The Issue that I need to calculate the average of (2002-2004) for each firm and replace the value in years 2002-2004 with the new value (i.e. the calculated average). for example, the new dataset should be like this:
Year Firm R
2002 A 14
2003 A 14
2004 A 14
2005 A 7
2006 A 15
2007 A 20
2008 A 3.5
2009 A 8
2002 B 26.333
2003 B 26.333
2004 B 26.333
2005 B 5.2
2006 B 11.8
2007 B 78
2008 B 90
2009 B 57
I have tried to use the following code:
df$R[df$Year==2002 & df$Year==2003 & df$Year==2004] = (df$R[df$Year==2002] + df$R[df$Year==2003] + df$R[df$Year==2004])/3
but when I apply it nothing changes!!!!!?????
I hope you can help with this issue
You can use data.table for this if you like:
library(data.table)
year <- c(rep(seq(2002,2009,1),2))
firm <- c(rep("A",8),rep("B",8))
r <- c(30,11,1,7,15,20,3.5,8,24,30,25,5.2,11.8,78,90,57)
aa <- data.table(year,firm,r)
aa[year>=2002 & year<=2004, r:= mean(r), by = firm]
Giving this result :
year firm r
1: 2002 A 14.00000
2: 2003 A 14.00000
3: 2004 A 14.00000
4: 2005 A 7.00000
5: 2006 A 15.00000
6: 2007 A 20.00000
7: 2008 A 3.50000
8: 2009 A 8.00000
9: 2002 B 26.33333
10: 2003 B 26.33333
11: 2004 B 26.33333
12: 2005 B 5.20000
13: 2006 B 11.80000
14: 2007 B 78.00000
15: 2008 B 90.00000
16: 2009 B 57.00000
The mistake in your code is that you are not grouping by Firm name and also using & instead or |. In my example test.txt is the file which has input same as in question.
Below code should help you achieve what you need.
library(dplyr)
df <- read.delim('test.txt', header = T, sep = '\t')
print(df)
# get unique firm names for grouping
firms <- unique(df$Firm)
# for each firm, calculate mean and update it
for (f in firms){
df$R[df$Firm == f & (df$Year==2002 | df$Year==2003 | df$Year==2004)] =
sum(df$R[df$Firm == f & (df$Year==2002 | df$Year==2003 | df$Year==2004)])/3
}
print(df)
Try this dplyr version:
library(tidyverse)
data %>%
filter(Year<2005) %>% # this subsets the data
group_by(Firm) %>% # state which values you want to evaluate
summarise(m=mean(R)) %>% # take the mean (named mean)
left_join(data) %>% # join the original data to the summarised data
mutate(R=ifelse(Year<2005 & Firm=='A', m,
ifelse(Year<2005 & Firm=='B', m, R))) %>% # nested ifelse to define conditions
select(year,firm,R) -> newdata # select the desired columns and rename the data.frame

R issues with merge/rbind/concatenate two data frames

I am a beginner with R so i apologise in advance if the question was asked elsewhere. Here is my issue:
I have two data frames, df1 and df2, with different number of rows and columns. The two frames have only one variable (column) in common called "customer_no". I want the merged frame to match records based on "customer_no" and by rows in df2 only.Both data.frames have multiple rows for each customer_no.
I tried the following:
merged.df <- (df1, df2, by="customer_no",all.y=TRUE)
The problem is that this assigns values of df1 to df2 where instead it should be empty. My questions are:
1) How can I tell the command to leave the unmatched columns empty?
2) How can I see from the merged file which row came from which df? I guess if I resolve the above question this should be easy to see by the empty columns.
I am missing something in my command but don't know what. If the question has been answered somewhere else, would you be still kind enough to rephrase it in English here for an R beginner?
Thanks!
Data example:
df1:
customer_no country year
10 UK 2001
10 UK 2002
10 UK 2003
20 US 2007
30 AU 2006
df2:
customer_no income
10 700
10 800
10 900
30 1000
Merged file should look like this:
merged.df:
customer_no income country year
10 UK 2001
10 UK 2002
10 UK 2003
10 700
10 800
10 900
30 AU 2006
30 1000
So:
It puts the columns all together, it adds the values of df2 right after the last one of df1 based on same customer_no and matches only customer_no from df2 (merged.df does not have customer_no 20). Also, it leaves empty all the other cells.
In STATA I use append but not sure in R...perhaps join?
THANKS!!
Try:
df1$id <- paste(df1$customer_no, 1, sep="_")
df2$id <- paste(df2$customer_no, 2, sep="_")
res <- merge(df1, df2, by=c('id', 'customer_no'),all=TRUE)[,-1]
res1 <- res[res$customer_no %in% df2$customer_no,]
res1
# customer_no country year income
#1 10 UK 2001 NA
#2 10 UK 2002 NA
#3 10 UK 2003 NA
#4 10 <NA> NA 700
#5 10 <NA> NA 800
#6 10 <NA> NA 900
#8 30 AU 2006 NA
#9 30 <NA> NA 1000
If you want to change NA to '',
res1[is.na(res1)] <- '' #But, I would leave it as `NA` as there are `numeric` columns.
Or, use rbindlist from data.table (Using the original datasets)
library(data.table)
indx <- df1$customer_no %in% df2$customer_no
rbindlist(list(df1[indx,], df2),fill=TRUE)[order(customer_no)]
# customer_no country year income
#1: 10 UK 2001 NA
#2: 10 UK 2002 NA
#3: 10 UK 2003 NA
#4: 10 NA NA 700
#5: 10 NA NA 800
#6: 10 NA NA 900
#7: 30 AU 2006 NA
#8: 30 NA NA 1000
You could also use the smartbind function from the gtools package.
require(gtools)
res <- smartbind(df1[df1$customer_no %in% df2$customer_no, ], df2)
res[order(res$customer_no), ]
# customer_no country year income
# 1:1 10 UK 2001 NA
# 1:2 10 UK 2002 NA
# 1:3 10 UK 2003 NA
# 2:1 10 <NA> NA 700
# 2:2 10 <NA> NA 800
# 2:3 10 <NA> NA 900
# 1:4 30 AU 2006 NA
# 2:4 30 <NA> NA 1000
Try:
df1$income = df2$country = df2$year = NA
rbind(df1, df2)
customer_no country year income
1 10 UK 2001 NA
2 10 UK 2002 NA
3 10 UK 2003 NA
4 20 US 2007 NA
5 30 AU 2006 NA
6 10 <NA> NA 700
7 10 <NA> NA 800
8 10 <NA> NA 900
9 30 <NA> NA 1000

Merge 2 data frame based on 2 columns with different column names

I have 2 very large data sets that looks like below:
merge_data <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
position=c("yes","no","yes","no","yes",
"no","yes","no","yes","yes"),
school = c("a","b","a","a","c","b","c","d","d","e"),
year1 = c(2000,2000,2000,2001,2001,2000,
2003,2005,2008,2009),
year2=year1-1)
merge_data
ID position school year1 year2
1 1 support a 2000 1999
2 2 oppose b 2000 1999
3 3 support a 2000 1999
4 4 oppose a 2001 2000
5 5 support c 2001 2000
6 6 oppose b 2000 1999
7 7 support c 2003 2002
8 8 oppose d 2005 2004
9 9 support d 2008 2007
10 10 support e 2009 2008
merge_data_2 <- data.frame(year=c(1999,1999,2000,2000,2000,2001,2003
,2012,2009,2009,2008,2002,2009,2005,
2001,2000,2002,2000,2008,2005),
amount=c(100,200,300,400,500,600,700,800,900,
1000,1100,1200,1300,1400,1500,1600,
1700,1800,1900,2000),
ID=c(1,1,2,2,2,3,3,3,5,6,8,9,10,13,15,17,19,20,21,7))
merge_data_2
year amount ID
1 1999 100 1
2 1999 200 1
3 2000 300 2
4 2000 400 2
5 2000 500 2
6 2001 600 3
7 2003 700 3
8 2012 800 3
9 2009 900 5
10 2009 1000 6
11 2008 1100 8
12 2002 1200 9
13 2009 1300 10
14 2005 1400 13
15 2001 1500 15
16 2000 1600 17
17 2002 1700 19
18 2000 1800 20
19 2008 1900 21
20 2005 2000 7
And what I want is:
ID position school year1 year2 amount
1 yes a 2000 1999 300
2 no b 2000 1999 1200
10 yes e 2009 2008 1300
for ID=1 in the merge_data_2, we have amount =300, since there are 2 cases where ID=1,and their year1 or year1 is equal to the year of ID=1 in merge_data
So basically what I want is to perform a merge based on the ID and year.
2 conditions:
ID from merge_data matches the ID from merge_data_2
one of the year1 and year2 from merge_data also matches the year from merge_data_2.
then make the merge based on the sum of the amount for each IDs.
and I think the code will be something looks like:
merge_data_final <- merge(merge_data, merge_data_2,
merge_data$ID == merge_data_2$ID && (merge_data$year1 ||
merge_data$year2 == merge_data_2$year))
Then somehow to aggregate the amount by ID.
Obviously I know the code is wrong, and I have been thinking about plyr or reshape library, but was having difficulties of getting my hands on them.
Any helps would be great! thanks guys!
As noted above, I think you have some discrepancies between your example input and output data. Here's the basic approach - you were on the right track with reshape2. You can simply melt() your data into long format so you are joining on a single column instead of the either/or bit you had going on before.
library(reshape2)
#melt into long format
merge_data_m <- melt(merge_data, measure.vars = c("year1", "year2"))
#merge together, specifying the joining columns
merge(merge_data_m, merge_data_2, by.x = c("ID", "value"), by.y = c("ID", "year"))
#-----
ID value position school variable amount
1 1 1999 yes a year2 100
2 1 1999 yes a year2 200
3 2 2000 no b year1 500
4 2 2000 no b year1 300
5 2 2000 no b year1 400

Resources