Remove group that has NAs in only some rows - r

I need to remove years that do not have measurements for every day of the year. Pretend this is a full set and I want to get rid of all 2001 rows because 2001 has one missing measurement.
year day value
2000 1 5
2000 2 3
2000 3 2
2000 4 3
2001 1 2
2001 2 NA
2001 3 6
2001 4 5
Sorry I don't have code attempts, I can't wrap my head around it right now and it took me forever to get this far. Prefer something I can %>% in, as it's at the end of a long run.

Filtering based on presence of NA values:
df %>%
group_by(year) %>%
filter(!anyNA(value))
Alternative filter conditions (pick what you find most readable):
all(!is.na(value))
sum(is.na(value)) == 0
!any(is.na(value))

Here's a one line solution using base R -
df %>% .[!ave(.$value, .$year, FUN = anyNA), ]
Example -
df <- data.frame(year = c(rep(2000, 4), rep(2001, 4)), day = 1:4, value = sample.int(10, 8))
df$value[6] <- NA_integer_
# year day value
# 1 2000 1 4
# 2 2000 2 3
# 3 2000 3 2
# 4 2000 4 7
# 5 2001 1 8
# 6 2001 2 NA
# 7 2001 3 1
# 8 2001 4 5
df %>% .[!ave(.$value, .$year, FUN = anyNA), ]
# year day value
# 1 2000 1 4
# 2 2000 2 3
# 3 2000 3 2
# 4 2000 4 7

In base R you could do:
subset(df,!year %in% year[is.na(value)])
# year day value
# 1 2000 1 8
# 2 2000 2 5
# 3 2000 3 4
# 4 2000 4 1

Related

How to keep only first value from distinct values in one column based on repeated values in other column in R? [duplicate]

The code below should group the data by year and then create two new columns with the first and last value of each year.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
dplyr::mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
mutate(
first = value[[1]],
last = value[[length(value)]]
)
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5

Assigning values from one data set to another dataset based on year

I am curious if anyone knows how to apply the Value.1s for each year of df1 to their corresponding years in df2. This will hopefully create two columns for "Value.1" and "Value.2" inside df2. Obviously, the real dataset is quite large and I would rather not do this the long way. I imagine the code will start with df2 %>% mutate(...), ifelse? case_when? I really appreciate any help!
df1 <- data.frame(Year = c(2000:2002),
Value.1 = c(0:2))
Year Value.1
1 2000 0
2 2001 1
3 2002 2
df2 <- data.frame(Year = c(2000,2000,2000,2001,2001,2001,2002,2002,2002),
Value.2 = c(1:9))
Year Value.2
1 2000 1
2 2000 2
3 2000 3
4 2001 4
5 2001 5
6 2001 6
7 2002 7
8 2002 8
9 2002 9
You can do:
df2 <- df2 %>%
mutate(
Value.1 = recode(Year,
!!!setNames(unique(df1$Value.1),
unique(df1$Year))))
> df2
Year Value.2 Value.1
1 2000 1 0
2 2000 2 0
3 2000 3 0
4 2001 4 1
5 2001 5 1
6 2001 6 1
7 2002 7 2
8 2002 8 2
9 2002 9 2
"Joins" are your friend when uniting dataframes. They're easy to understand. Try this:
df3 <- dplyr::left_join(df2, df1, by = "Year")
Year Value.2 Value.1
1 2000 1 0
2 2000 2 0
3 2000 3 0
4 2001 4 1
5 2001 5 1
6 2001 6 1
7 2002 7 2
8 2002 8 2
9 2002 9 2

Change the form while merging multiple data frames

I have several data frames that are all in same format, like:
price <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
size <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
performance <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
> price
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> size
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> performance
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
and I want to merge these data frames but the result is in different form, the desired output is like:
> df
name Year price size performance
1 A 2001 1 1 1
2 A 2002 2 2 2
3 A 2003 3 3 3
4 B 2001 2 2 2
5 B 2002 3 3 3
6 B 2003 4 4 4
7 C 2001 3 3 3
8 C 2002 4 4 4
9 C 2003 5 5 5
which arranges the data in the order of names, and then the ordered date. Since I have over 2000 names and 180 dates in each of the 20 data frames it's too difficult to sort it by just imputing the specific name.
You need to convert your data frames to long format then join them together
library(tidyverse)
price_long <- price %>% gather(key, value = "price", -Year)
size_long <- size %>% gather(key, value = "size", -Year)
performance_long <- performance %>% gather(key, value = "performance", -Year)
price_long %>%
left_join(size_long) %>%
left_join(performance_long)
Joining, by = c("Year", "key")
Joining, by = c("Year", "key")
Year key price size performance
1 2001 A 1 1 1
2 2002 A 2 2 2
3 2003 A 3 3 3
4 2001 B 2 2 2
5 2002 B 3 3 3
6 2003 B 4 4 4
7 2001 C 4 4 4
8 2002 C 5 5 5
9 2003 C 6 6 6
you can use data.table
library(data.table)
a=list(price=price,size=size,performance=performance)
dcast(melt(rbindlist(a,T,idcol = "name"),1:2),variable+Year~name)
variable Year performance price size
1: A 2001 1 1 1
2: A 2002 2 2 2
3: A 2003 3 3 3
4: B 2001 2 2 2
5: B 2002 3 3 3
6: B 2003 4 4 4
7: C 2001 4 4 4
8: C 2002 5 5 5
9: C 2003 6 6 6
We can combine the data frames, gather and spread the combined data frame.
library(tidyverse)
dat <- list(price, size, performance) %>%
setNames(c("price", "size", "performance")) %>%
bind_rows(.id = "type") %>%
gather(name, value, A:C) %>%
spread(type, value) %>%
arrange(name, Year)
dat
# Year name performance price size
# 1 2001 A 1 1 1
# 2 2002 A 2 2 2
# 3 2003 A 3 3 3
# 4 2001 B 2 2 2
# 5 2002 B 3 3 3
# 6 2003 B 4 4 4
# 7 2001 C 4 4 4
# 8 2002 C 5 5 5
# 9 2003 C 6 6 6
dplyr::bind_rows comes quiet handy in such scenarios. A solution can be as:
library(tidyverse)
bind_rows(list(price = price, size = size, performance = performance), .id="Type") %>%
gather(Key, Value, - Type, -Year) %>%
spread(Type, Value)
# Year Key performance price size
# 1 2001 A 1 1 1
# 2 2001 B 2 2 2
# 3 2001 C 4 4 4
# 4 2002 A 2 2 2
# 5 2002 B 3 3 3
# 6 2002 C 5 5 5
# 7 2003 A 3 3 3
# 8 2003 B 4 4 4
# 9 2003 C 6 6 6
The above solution is very much similar to the one by #www. It just avoids use of setNames
To round it out, here's package-free base R answer.
# gather the data.frames into a list
myList <- mget(ls())
Note that the three data.frames are the only objects in my environment.
# get the final data.frame
Reduce(merge,
Map(function(x, y) setNames(cbind(x[1], stack(x[-1])), c("Year", y, "ID")),
myList, names(myList)))
This returns
Year ID performance price size
1 2001 A 1 1 1
2 2001 B 2 2 2
3 2001 C 4 4 4
4 2002 A 2 2 2
5 2002 B 3 3 3
6 2002 C 5 5 5
7 2003 A 3 3 3
8 2003 B 4 4 4
9 2003 C 6 6 6

Sum rows by interval Dataframe

I need help in a research project problem.
The code problem is: i have a big data frame called FRAMETRUE, and a need to sum certain columns of those rows by row in a new column that I will call Group1.
For example:
head.table(FRAMETRUE)
Municipalities 1989 1990 1991 1992 1993 1994 1995 1996 1997
A 3 3 5 2 3 4 2 5 3
B 7 1 2 4 5 0 4 8 9
C 10 15 1 3 2 NA 2 5 3
D 7 0 NA 5 3 6 4 5 5
E 5 1 2 4 0 3 5 4 2
I must sum the values in the rows from 1989 to 1995 in a new column called Group1. like the column Group1 should be
Group1
22
23
and so on...
I know it must be something simple, I just don't get it, I'm still learning R
If you are looking for an R solution, here's one way to do it: The trick is using [ combined with rowSums
FRAMETRUE$Group1 <- rowSums(FRAMETRUE[, 2:8], na.rm = TRUE)
A dplyr solution that allows you to refer to your columns by their names:
library(dplyr)
municipalities <- LETTERS[1:4]
year1989 <- sample(4)
year1990 <- sample(4)
year1991 <- sample(4)
df <- data.frame(municipalities,year1989,year1990,year1991)
# df
municipalities year1989 year1990 year1991
1 A 4 2 2
2 B 3 1 3
3 C 1 3 4
4 D 2 4 1
# Calculate row sums here
df <- mutate(df, Group1 = rowSums(select(df, year1989:year1991)))
# df
municipalities year1989 year1990 year1991 Group1
1 A 4 2 2 8
2 B 3 1 3 7
3 C 1 3 4 8
4 D 2 4 1 7

merge data frames in R in two dimensions

DATA FRAME 1: HOUSE PRICE
year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
DATA FRAME 2: MORTGAGE INFO
ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
OUTCOME DESIRED:
ID MSA YEAR MONTH HOUSE_PRICE
1 MSA1 2000 2 1
2 MSA3 2001 3 7
3 MSA2 2001 3 5
Anyone knows how to achieve this in an efficient way? data frame 2 is huge and data frame 1 is ok size. Thanks!
Assuming both are data.tables dt1 and dt2, this can be done without having to cast them to long form as follows:
require(data.table)
dt2[dt1, .(ID, MSA, House_price = get(MSA)), by=.EACHI,
nomatch=0L, on=c(YEAR="year", MONTH="month")]
# YEAR MONTH ID MSA House_price
# 1: 2000 1 4 MSA1 12
# 2: 2000 2 1 MSA1 1
# 3: 2001 3 2 MSA3 7
# 4: 2001 3 3 MSA2 5
dt1 = fread('year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
')
dt2 = fread('ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
')
This looks like a case of turning a data frame from wide to long form and then merging two data frames. Here is a dplyr solution with gather and right_join. The name change is just here to make the join easier.
library(dplyr)
library(tidyr)
names(df1) <- toupper(names(df1))
gather(df1,MSA,HOUSE_PRICE,-YEAR,-MONTH) %>%
right_join(df2,by = c("YEAR","MONTH","MSA"))
output
YEAR MONTH MSA HOUSE_PRICE ID
1 2000 2 MSA1 1 1
2 2001 3 MSA3 7 2
3 2001 3 MSA2 5 3
4 2000 1 MSA1 12 4
5 2000 3 MSA3 NA 5

Resources