DATA FRAME 1: HOUSE PRICE
year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
DATA FRAME 2: MORTGAGE INFO
ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
OUTCOME DESIRED:
ID MSA YEAR MONTH HOUSE_PRICE
1 MSA1 2000 2 1
2 MSA3 2001 3 7
3 MSA2 2001 3 5
Anyone knows how to achieve this in an efficient way? data frame 2 is huge and data frame 1 is ok size. Thanks!
Assuming both are data.tables dt1 and dt2, this can be done without having to cast them to long form as follows:
require(data.table)
dt2[dt1, .(ID, MSA, House_price = get(MSA)), by=.EACHI,
nomatch=0L, on=c(YEAR="year", MONTH="month")]
# YEAR MONTH ID MSA House_price
# 1: 2000 1 4 MSA1 12
# 2: 2000 2 1 MSA1 1
# 3: 2001 3 2 MSA3 7
# 4: 2001 3 3 MSA2 5
dt1 = fread('year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
')
dt2 = fread('ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
')
This looks like a case of turning a data frame from wide to long form and then merging two data frames. Here is a dplyr solution with gather and right_join. The name change is just here to make the join easier.
library(dplyr)
library(tidyr)
names(df1) <- toupper(names(df1))
gather(df1,MSA,HOUSE_PRICE,-YEAR,-MONTH) %>%
right_join(df2,by = c("YEAR","MONTH","MSA"))
output
YEAR MONTH MSA HOUSE_PRICE ID
1 2000 2 MSA1 1 1
2 2001 3 MSA3 7 2
3 2001 3 MSA2 5 3
4 2000 1 MSA1 12 4
5 2000 3 MSA3 NA 5
Related
I am curious if anyone knows how to apply the Value.1s for each year of df1 to their corresponding years in df2. This will hopefully create two columns for "Value.1" and "Value.2" inside df2. Obviously, the real dataset is quite large and I would rather not do this the long way. I imagine the code will start with df2 %>% mutate(...), ifelse? case_when? I really appreciate any help!
df1 <- data.frame(Year = c(2000:2002),
Value.1 = c(0:2))
Year Value.1
1 2000 0
2 2001 1
3 2002 2
df2 <- data.frame(Year = c(2000,2000,2000,2001,2001,2001,2002,2002,2002),
Value.2 = c(1:9))
Year Value.2
1 2000 1
2 2000 2
3 2000 3
4 2001 4
5 2001 5
6 2001 6
7 2002 7
8 2002 8
9 2002 9
You can do:
df2 <- df2 %>%
mutate(
Value.1 = recode(Year,
!!!setNames(unique(df1$Value.1),
unique(df1$Year))))
> df2
Year Value.2 Value.1
1 2000 1 0
2 2000 2 0
3 2000 3 0
4 2001 4 1
5 2001 5 1
6 2001 6 1
7 2002 7 2
8 2002 8 2
9 2002 9 2
"Joins" are your friend when uniting dataframes. They're easy to understand. Try this:
df3 <- dplyr::left_join(df2, df1, by = "Year")
Year Value.2 Value.1
1 2000 1 0
2 2000 2 0
3 2000 3 0
4 2001 4 1
5 2001 5 1
6 2001 6 1
7 2002 7 2
8 2002 8 2
9 2002 9 2
I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA
I have a data frame of this form
familyid Year memberid value
1 2000 1 5
1 2000 2 6
2 2000 1 5
3 2000 1 7
3 2000 2 8
1 2002 1 5
1 2002 2 5
2 2002 1 6
3 2002 1 7
3 2002 2 8
I want to transform it in the following way
familyid Year value_1 value_2
1 2000 5 6
2 2000 5 NA
3 2000 7 8
1 2002 5 5
2 2002 6 NA
3 2002 7 8
In other words I want to group my obs by familyid and year and then, for each memberid, create a column reporting the corresponding value of the last column. Whenever that family has only one member, I want to have NA in the value_2 column associated with member 2 of the reference family.
To do this I usually and succesfully use the following code
setDT(df)
dfnew<-data.table::dcast(df, Year + familyid ~ memberid, value.var=c("value"))
Unfortunately this time I get something like this
familyid Year value_1 value_2
1 2000 1 1
2 2000 1 0
3 2000 1 1
1 2002 1 1
2 2002 1 0
3 2002 1 1
In other words I get a new dataframe with 1 whenever the member exists (indeed column value_1 contains all 1 since all families have at least one member), 0 whenever the member does not exist, regardless the actual value in column "value". Does anybody know why this happens? Thank you for your time.
With tidyverse:
library(tidyverse)
df<-read.table(text="familyid Year memberid value
1 2000 1 5
1 2000 2 6
2 2000 1 5
3 2000 1 7
3 2000 2 8
1 2002 1 5
1 2002 2 5
2 2002 1 6
3 2002 1 7
3 2002 2 8",header=T)
df%>%
group_by(familyid,Year)%>%
spread(memberid,value)%>%
arrange(Year)%>%
mutate_at(c("1", "2"),.funs = funs( ifelse(is.na(.),0,1)))
# A tibble: 6 x 4
# Groups: familyid, Year [6]
familyid Year `1` `2`
<int> <int> <dbl> <dbl>
1 1 2000 1. 1.
2 2 2000 1. 0.
3 3 2000 1. 1.
4 1 2002 1. 1.
5 2 2002 1. 0.
6 3 2002 1. 1.
I need to remove years that do not have measurements for every day of the year. Pretend this is a full set and I want to get rid of all 2001 rows because 2001 has one missing measurement.
year day value
2000 1 5
2000 2 3
2000 3 2
2000 4 3
2001 1 2
2001 2 NA
2001 3 6
2001 4 5
Sorry I don't have code attempts, I can't wrap my head around it right now and it took me forever to get this far. Prefer something I can %>% in, as it's at the end of a long run.
Filtering based on presence of NA values:
df %>%
group_by(year) %>%
filter(!anyNA(value))
Alternative filter conditions (pick what you find most readable):
all(!is.na(value))
sum(is.na(value)) == 0
!any(is.na(value))
Here's a one line solution using base R -
df %>% .[!ave(.$value, .$year, FUN = anyNA), ]
Example -
df <- data.frame(year = c(rep(2000, 4), rep(2001, 4)), day = 1:4, value = sample.int(10, 8))
df$value[6] <- NA_integer_
# year day value
# 1 2000 1 4
# 2 2000 2 3
# 3 2000 3 2
# 4 2000 4 7
# 5 2001 1 8
# 6 2001 2 NA
# 7 2001 3 1
# 8 2001 4 5
df %>% .[!ave(.$value, .$year, FUN = anyNA), ]
# year day value
# 1 2000 1 4
# 2 2000 2 3
# 3 2000 3 2
# 4 2000 4 7
In base R you could do:
subset(df,!year %in% year[is.na(value)])
# year day value
# 1 2000 1 8
# 2 2000 2 5
# 3 2000 3 4
# 4 2000 4 1
I have several data frames that are all in same format, like:
price <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
size <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
performance <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
> price
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> size
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> performance
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
and I want to merge these data frames but the result is in different form, the desired output is like:
> df
name Year price size performance
1 A 2001 1 1 1
2 A 2002 2 2 2
3 A 2003 3 3 3
4 B 2001 2 2 2
5 B 2002 3 3 3
6 B 2003 4 4 4
7 C 2001 3 3 3
8 C 2002 4 4 4
9 C 2003 5 5 5
which arranges the data in the order of names, and then the ordered date. Since I have over 2000 names and 180 dates in each of the 20 data frames it's too difficult to sort it by just imputing the specific name.
You need to convert your data frames to long format then join them together
library(tidyverse)
price_long <- price %>% gather(key, value = "price", -Year)
size_long <- size %>% gather(key, value = "size", -Year)
performance_long <- performance %>% gather(key, value = "performance", -Year)
price_long %>%
left_join(size_long) %>%
left_join(performance_long)
Joining, by = c("Year", "key")
Joining, by = c("Year", "key")
Year key price size performance
1 2001 A 1 1 1
2 2002 A 2 2 2
3 2003 A 3 3 3
4 2001 B 2 2 2
5 2002 B 3 3 3
6 2003 B 4 4 4
7 2001 C 4 4 4
8 2002 C 5 5 5
9 2003 C 6 6 6
you can use data.table
library(data.table)
a=list(price=price,size=size,performance=performance)
dcast(melt(rbindlist(a,T,idcol = "name"),1:2),variable+Year~name)
variable Year performance price size
1: A 2001 1 1 1
2: A 2002 2 2 2
3: A 2003 3 3 3
4: B 2001 2 2 2
5: B 2002 3 3 3
6: B 2003 4 4 4
7: C 2001 4 4 4
8: C 2002 5 5 5
9: C 2003 6 6 6
We can combine the data frames, gather and spread the combined data frame.
library(tidyverse)
dat <- list(price, size, performance) %>%
setNames(c("price", "size", "performance")) %>%
bind_rows(.id = "type") %>%
gather(name, value, A:C) %>%
spread(type, value) %>%
arrange(name, Year)
dat
# Year name performance price size
# 1 2001 A 1 1 1
# 2 2002 A 2 2 2
# 3 2003 A 3 3 3
# 4 2001 B 2 2 2
# 5 2002 B 3 3 3
# 6 2003 B 4 4 4
# 7 2001 C 4 4 4
# 8 2002 C 5 5 5
# 9 2003 C 6 6 6
dplyr::bind_rows comes quiet handy in such scenarios. A solution can be as:
library(tidyverse)
bind_rows(list(price = price, size = size, performance = performance), .id="Type") %>%
gather(Key, Value, - Type, -Year) %>%
spread(Type, Value)
# Year Key performance price size
# 1 2001 A 1 1 1
# 2 2001 B 2 2 2
# 3 2001 C 4 4 4
# 4 2002 A 2 2 2
# 5 2002 B 3 3 3
# 6 2002 C 5 5 5
# 7 2003 A 3 3 3
# 8 2003 B 4 4 4
# 9 2003 C 6 6 6
The above solution is very much similar to the one by #www. It just avoids use of setNames
To round it out, here's package-free base R answer.
# gather the data.frames into a list
myList <- mget(ls())
Note that the three data.frames are the only objects in my environment.
# get the final data.frame
Reduce(merge,
Map(function(x, y) setNames(cbind(x[1], stack(x[-1])), c("Year", y, "ID")),
myList, names(myList)))
This returns
Year ID performance price size
1 2001 A 1 1 1
2 2001 B 2 2 2
3 2001 C 4 4 4
4 2002 A 2 2 2
5 2002 B 3 3 3
6 2002 C 5 5 5
7 2003 A 3 3 3
8 2003 B 4 4 4
9 2003 C 6 6 6