dcast reporting 1 or 0 rather than actual values - r

I have a data frame of this form
familyid Year memberid value
1 2000 1 5
1 2000 2 6
2 2000 1 5
3 2000 1 7
3 2000 2 8
1 2002 1 5
1 2002 2 5
2 2002 1 6
3 2002 1 7
3 2002 2 8
I want to transform it in the following way
familyid Year value_1 value_2
1 2000 5 6
2 2000 5 NA
3 2000 7 8
1 2002 5 5
2 2002 6 NA
3 2002 7 8
In other words I want to group my obs by familyid and year and then, for each memberid, create a column reporting the corresponding value of the last column. Whenever that family has only one member, I want to have NA in the value_2 column associated with member 2 of the reference family.
To do this I usually and succesfully use the following code
setDT(df)
dfnew<-data.table::dcast(df, Year + familyid ~ memberid, value.var=c("value"))
Unfortunately this time I get something like this
familyid Year value_1 value_2
1 2000 1 1
2 2000 1 0
3 2000 1 1
1 2002 1 1
2 2002 1 0
3 2002 1 1
In other words I get a new dataframe with 1 whenever the member exists (indeed column value_1 contains all 1 since all families have at least one member), 0 whenever the member does not exist, regardless the actual value in column "value". Does anybody know why this happens? Thank you for your time.

With tidyverse:
library(tidyverse)
df<-read.table(text="familyid Year memberid value
1 2000 1 5
1 2000 2 6
2 2000 1 5
3 2000 1 7
3 2000 2 8
1 2002 1 5
1 2002 2 5
2 2002 1 6
3 2002 1 7
3 2002 2 8",header=T)
df%>%
group_by(familyid,Year)%>%
spread(memberid,value)%>%
arrange(Year)%>%
mutate_at(c("1", "2"),.funs = funs( ifelse(is.na(.),0,1)))
# A tibble: 6 x 4
# Groups: familyid, Year [6]
familyid Year `1` `2`
<int> <int> <dbl> <dbl>
1 1 2000 1. 1.
2 2 2000 1. 0.
3 3 2000 1. 1.
4 1 2002 1. 1.
5 2 2002 1. 0.
6 3 2002 1. 1.

Related

How to keep only first value from distinct values in one column based on repeated values in other column in R? [duplicate]

The code below should group the data by year and then create two new columns with the first and last value of each year.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
dplyr::mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
mutate(
first = value[[1]],
last = value[[length(value)]]
)
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5

Assigning values from one data set to another dataset based on year

I am curious if anyone knows how to apply the Value.1s for each year of df1 to their corresponding years in df2. This will hopefully create two columns for "Value.1" and "Value.2" inside df2. Obviously, the real dataset is quite large and I would rather not do this the long way. I imagine the code will start with df2 %>% mutate(...), ifelse? case_when? I really appreciate any help!
df1 <- data.frame(Year = c(2000:2002),
Value.1 = c(0:2))
Year Value.1
1 2000 0
2 2001 1
3 2002 2
df2 <- data.frame(Year = c(2000,2000,2000,2001,2001,2001,2002,2002,2002),
Value.2 = c(1:9))
Year Value.2
1 2000 1
2 2000 2
3 2000 3
4 2001 4
5 2001 5
6 2001 6
7 2002 7
8 2002 8
9 2002 9
You can do:
df2 <- df2 %>%
mutate(
Value.1 = recode(Year,
!!!setNames(unique(df1$Value.1),
unique(df1$Year))))
> df2
Year Value.2 Value.1
1 2000 1 0
2 2000 2 0
3 2000 3 0
4 2001 4 1
5 2001 5 1
6 2001 6 1
7 2002 7 2
8 2002 8 2
9 2002 9 2
"Joins" are your friend when uniting dataframes. They're easy to understand. Try this:
df3 <- dplyr::left_join(df2, df1, by = "Year")
Year Value.2 Value.1
1 2000 1 0
2 2000 2 0
3 2000 3 0
4 2001 4 1
5 2001 5 1
6 2001 6 1
7 2002 7 2
8 2002 8 2
9 2002 9 2

R: How can I group rows in a dataframe, ID rows meeting a condition, then delete prior rows for the group?

I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA

Summing and discarding grouped variables

I have a dataframe of this form
familyid memberid year contract months
1 1 2000 1 12
1 1 2001 1 12
1 1 2002 1 12
1 1 2003 1 12
2 3 2000 2 12
2 3 2001 2 12
2 3 2002 2 12
2 3 2003 2 12
3 2 2000 1 5
3 2 2000 2 5
3 2 2001 1 12
3 2 2002 1 12
3 2 2003 1 12
4 1 2000 2 12
4 1 2001 2 12
4 1 2002 2 12
4 1 2003 2 12
5 2 2000 1 8
5 2 2001 1 12
5 2 2002 1 12
5 2 2003 1 4
5 2 2003 1 6
I want back a dataframe like
familyid memberid year contract months
1 1 2000 1 12
1 1 2001 1 12
1 1 2002 1 12
1 1 2003 1 12
2 3 2000 2 12
2 3 2001 2 12
2 3 2002 2 12
2 3 2003 2 12
4 1 2000 2 12
4 1 2001 2 12
4 1 2002 2 12
4 1 2003 2 12
5 2 2000 1 8
5 2 2001 1 12
5 2 2002 1 12
**5 2 2003 1 10**
Basically I want to sum the variable months if they same familyid shows the same value for the variable "contract" (in my example I am summing 6 and 4 for familyid=5 in year=2003). However, I want also to discard familiids which show, during the same year, two different values for the variable contract (in my case I am discarding familyid=3 since it shows contract=1 and contract=2 in year=2000). For the other observations I want to keep things unchanged.
Does anybody know how to do this?
Thanks to anyone helping me.
Marco
You mentioned that you wanted to get the total months within one family's single contract in one year, but also to remove the families entirely with more than one contract in a year. Here's one approach:
library(dplyr)
df2 <- df %>%
group_by(familyid, memberid, year, contract) %>%
summarize(months = sum(months, na.rm = T)) %>%
# We need this to answer the second part. How many contracts did this family have this year?
mutate(contracts_this_yr = n()) %>%
ungroup() %>%
# Only include the families with no years of multiple contracts
group_by(familyid, memberid) %>%
filter(max(contracts_this_yr) < 2) %>%
ungroup()
Output
df2
# A tibble: 16 x 5
familyid memberid year contract months
<int> <int> <int> <int> <int>
1 1 1 2000 1 12
2 1 1 2001 1 12
3 1 1 2002 1 12
4 1 1 2003 1 12
5 2 3 2000 2 12
6 2 3 2001 2 12
7 2 3 2002 2 12
8 2 3 2003 2 12
9 4 1 2000 2 12
10 4 1 2001 2 12
11 4 1 2002 2 12
12 4 1 2003 2 12
13 5 2 2000 1 8
14 5 2 2001 1 12
15 5 2 2002 1 12
16 5 2 2003 1 10

merge data frames in R in two dimensions

DATA FRAME 1: HOUSE PRICE
year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
DATA FRAME 2: MORTGAGE INFO
ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
OUTCOME DESIRED:
ID MSA YEAR MONTH HOUSE_PRICE
1 MSA1 2000 2 1
2 MSA3 2001 3 7
3 MSA2 2001 3 5
Anyone knows how to achieve this in an efficient way? data frame 2 is huge and data frame 1 is ok size. Thanks!
Assuming both are data.tables dt1 and dt2, this can be done without having to cast them to long form as follows:
require(data.table)
dt2[dt1, .(ID, MSA, House_price = get(MSA)), by=.EACHI,
nomatch=0L, on=c(YEAR="year", MONTH="month")]
# YEAR MONTH ID MSA House_price
# 1: 2000 1 4 MSA1 12
# 2: 2000 2 1 MSA1 1
# 3: 2001 3 2 MSA3 7
# 4: 2001 3 3 MSA2 5
dt1 = fread('year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
')
dt2 = fread('ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
')
This looks like a case of turning a data frame from wide to long form and then merging two data frames. Here is a dplyr solution with gather and right_join. The name change is just here to make the join easier.
library(dplyr)
library(tidyr)
names(df1) <- toupper(names(df1))
gather(df1,MSA,HOUSE_PRICE,-YEAR,-MONTH) %>%
right_join(df2,by = c("YEAR","MONTH","MSA"))
output
YEAR MONTH MSA HOUSE_PRICE ID
1 2000 2 MSA1 1 1
2 2001 3 MSA3 7 2
3 2001 3 MSA2 5 3
4 2000 1 MSA1 12 4
5 2000 3 MSA3 NA 5

Resources