Remove rows conditionally in a dataframe [duplicate] - r

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I have got a dataframe and I would like to remove some duplicate rows by taking the ones with max values.
Here an example simplified of my dataframe:
Code Weight Year
1 27009 289 1975
2 27009 300 1975
3 27009 376 1977
4 30010 259 1975
5 30010 501 1979
6 30010 398 1979
[....]
My output should be:
Code Weight Year
1 27009 300 1975
2 27009 376 1977
3 30010 259 1975
4 30010 501 1979
[....]
Between Code and Weight I have got 5 more columns with different values and between Weight and Year one more column with still different values.
Should I use an if statement?

You could use the dplyr package:
df <- read.table(text = "Code Weight Year
27009 289 1975
27009 300 1975
27009 376 1977
30010 259 1975
30010 501 1979
30010 398 1979", header = TRUE)
library(dplyr)
df$x <- rnorm(6)
df %>%
group_by(Year, Code) %>%
slice(which.max(Weight))
# Code Weight Year x
# (int) (int) (int) (dbl)
# 1 27009 300 1975 1.3696332
# 2 30010 259 1975 1.1095553
# 3 27009 376 1977 -1.0672932
# 4 30010 501 1979 0.1152063
As a second solution you coud use the data.table package.
setDT(df)
df[order(-Weight) ,head(.SD,1), keyby = .(Year, Code)]
The results are the same.

Simply run aggregate in base R using Code and Year as the grouping. This will take max values of all other numeric columns:
finaldf <- aggregate(. ~ Code + Year, df, FUN = max)

Related

Create a table out of a tibble

I do have the following dataframe with 45 million observations:
year month variable
1992 1 0
1992 1 1
1992 1 1
1992 2 0
1992 2 1
1992 2 0
My goal is to count the frequency of the variable for each month of a year.
I was already able to generate these sums with cps_data as my dataframe and SKILL_1 as my variable.
cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum))
Logically, I obtained 348 different rows as a tibble. Now, I struggle to create a new table with these values. My new table should look similar to my tibble. How can I do that? Is there even a way? I've already tried to read in an excel file with a date range from 01/1992 - 01/2021 in order to obtain exactly 349 rows and then merge it with the rows of the tibble, but it did not work..
# A tibble: 349 x 3
# Groups: YEAR [30]
YEAR MONTH name
<dbl> <int+lbl> <dbl>
1 1992 1 [January] 499
2 1992 2 [February] 482
3 1992 3 [March] 485
4 1992 4 [April] 457
5 1992 5 [May] 434
6 1992 6 [June] 470
7 1992 7 [July] 450
8 1992 8 [August] 438
9 1992 9 [September] 442
10 1992 10 [October] 427
# ... with 339 more rows
many thanks in advance!!
library(zoo)
createmonthyear <- function(start_date,end_date){
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)}
Once you create the function, you can specify the start and end date you want:
left_table <- data.frame(createmonthyear(1991-01-01,2021-01-01))
then left join the output with what you have
library(dplyr)
right_table <- data.frame(cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum)))
results <- left_join(left_table, right_table, by = c("Year" = "year", "Month" = "month")

Alter variable to lag by year

I have a data set I need to test for autocorrelation in a variable.
To do this, I want to first lag it by one period, to test that autocorrelation.
However, as the data is on US elections, the data is only available in two-year intervals, i.e. 1968, 1970, 1970, 1972, etc.
As far as I know, I'll need to somehow alter the year variable so that it can run annually in some way so that I can lag the variable of interest by one period/year.
I assume that dplyr() is helpful in some way, but I am not sure how.
Yes, dplyr has a helpful lag function that works well in these cases. Since you didn't provide sample data or the specific test that you want to perform, here is a simple example showing an approach you might take:
> df <- data.frame(year = seq(1968, 1978, 2), votes = sample(1000, 6))
> df
year votes
1 1968 565
2 1970 703
3 1972 761
4 1974 108
5 1976 107
6 1978 449
> dplyr::mutate(df, vote_diff = votes - dplyr::lag(votes))
year votes vote_diff
1 1968 565 NA
2 1970 703 138
3 1972 761 58
4 1974 108 -653
5 1976 107 -1
6 1978 449 342

Calculating age per animal by subtracting years in R

I am looking to calculate relative age of animals. I need to subtract sequentially each year from the next for each animal in my dataset. Because an animal can have multiple reproductive events in a year, I need the age for the remaining events in that year (i.e. all events after the first) to be the same as the initial calculation.
Update:
The dataset more resembles this:
Year ID Age
1 1975 6 -1
2 1975 6 -1
3 1976 6 -1
4 1977 6 -1
6 1975 9 -1
8 1978 9 -1
And I need it to look like this
Year ID Age
1 1975 6 0
2 1975 6 0
3 1976 6 1
4 1977 6 2
6 1975 9 0
8 1978 9 3
Apologies for the initial confusion, if I wasn't clear on what I needed to accomplish.
Any help would be greatly appreciated.
Things done "by group" are usually easiest to do using dplyr or data.table
library(dplyr)
your_data %>%
group_by(ID) %>% # group by ID
mutate(Age = Year - min(Year)) # add new column
or
library(data.table)
setDT(your_data) # convert to data table
# add new column by group
your_data[, Age := Year - min(Year), by = ID]
In base R, ave is probably easiest for adding a groupwise columns to existing data:
your_data$Age = with(your_data, ave(Year, ID, function(x) x - min(x)))
but the syntax isn't as nice as the options above.
You can test on this data:
your_data = read.table(text = " Year ID Age
1 1975 6 -1
2 1975 6 -1
3 1976 6 -1
4 1977 6 -1
6 1975 9 -1
8 1978 9 -1 ", header = T)
if you're trying to figure out the relative age based on one intial birth year, 1975 (which it seems like you are), then you can just make a new column called "RelativeAge" and set it equal to the year - 1975
data$RelativeAge = (Year-1975)
then just get rid of the original "Age" column, or rename as necessary

How to find correlation in a data set

I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2
I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607

Looking up values without loop in R

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.
Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.
We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

Resources