Looping through and comparing values in 2 dataframes - r

I have 2 dataframes that I need to loop through.
Df1[1:5,]
year month Vol
1 2015 7 4.82e-05
2 2015 6 5.91e-05
3 2015 5 6.56e-05
4 2015 4 6.10e-05
5 2015 3 7.85e-05
Df2[1:5,]
year month IB
1 2015 7 0
2 2015 4 1
3 2015 3 0
4 2015 6 1
5 2015 5 0
I need to loop through DF1, compare the months from DF1 and DF2, and if they are the same then set DF1$IB<-DF2$IB. I tried using sapply, but I get this error
tmp<-sapply(DF1$month,function(x){if(DF2$month==x){
DF1$IB<-DF2$IB
}})
Warning messages:
1: In if (DF2$month == x) { :
the condition has length > 1 and only the first element will be used
.....
Any help would be greatly appreciated. Otherwise I would have to resort to multiple for loops, and since DF1 is 900K rows long and DF2 is 300 rows long, that seems very inefficient to me.

With the latest version (see here how to install v1.9.5 from GH) you don't need to set keys and just need setDT(df1)[df2, on = c("year","month")] which add the IB, this gives:
year month Vol IB
1: 2015 7 4.82e-05 0
2: 2015 4 6.10e-05 1
3: 2015 3 7.85e-05 0
4: 2015 6 5.91e-05 1
5: 2015 5 6.56e-05 0
Supposing that the year/month are not equal for both datasets, you have to join differently:
setDT(df2)[df1, on = c("year","month")]
which gives:
year month IB Vol
1: 2015 7 0 4.82e-05
2: 2015 6 1 5.91e-05
3: 2015 5 0 6.56e-05
4: 2015 4 1 6.10e-05
5: 2015 3 NA 7.85e-05
Used data for second example:
df1 <- structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2015L), month = c(7L, 6L, 5L, 4L, 3L), Vol = c(4.82e-05, 5.91e-05, 6.56e-05, 6.1e-05, 7.85e-05)), .Names = c("year", "month", "Vol"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
df2 <- structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2015L), month = c(7L, 4L, 2L, 6L, 5L), IB = c(0L, 1L, 0L, 1L, 0L)), .Names = c("year", "month", "IB"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

If your Df1 is that large data.tables might be better than merge.
library(data.table)
setkey(setDT(Df1),year,month)[setDT(Df2),IB:=IB]
Df1
# year month Vol IB
# 1: 2015 3 7.85e-05 0
# 2: 2015 4 6.10e-05 1
# 3: 2015 5 6.56e-05 0
# 4: 2015 6 5.91e-05 1
# 5: 2015 7 4.82e-05 0
So this converts Df1 to a data.table in indexes it on year and month, then does a data.table join on Df2 (also converted to a data.table), then adds the IB column from Df2 to Df1.
Using a more realistic example:
set.seed(1)
Df1 <- data.frame(year=rep(2015,1e6),
month=sample(3:7,1e6,replace=TRUE),
Vol=rnorm(1e6))
system.time(result.mrg <- merge(Df1,Df2,by=c("year","month")))
# user system elapsed
# 11.8 0.0 11.8
system.time(result.dt <- setkey(setDT(Df1),year,month[setDT(Df2),IB:=IB])
# user system elapsed
# 0.07 0.00 0.06
identical(result.mrg$IB, result.dt$IB)
# [1] TRUE

Related

How to create dummy that equals 1 (and 0 otherwise) if an id appears only once? (in R)

structure(list(id = c(1L, 1L, 2L, 3L, 3L, 4L), hire_year = c(2017L,
2017L, 2017L, 2017L, 2016L, 2016L)), class = "data.frame", row.names = c(NA,
-6L))
id hire_year
1 1 2017
2 1 2017
3 2 2017
4 3 2017
5 3 2016
6 4 2016
**Expected output**
id hire_year dummy
1 1 2017 0
2 1 2017 0
3 2 2017 1
4 3 2017 0
5 3 2016 0
6 4 2016 1
How to create dummy that equals 1 (and 0 otherwise) if an id appears only once?
With tidyverse, we can group by the id, then use the number of observations within an ifelse statement.
library(tidyverse)
df %>%
group_by(id) %>%
mutate(dummy = ifelse(n() == 1, 1, 0))
Or we could add the number of observations, then change the value based on the condition.
df %>%
add_count(id, name = "dummy") %>%
mutate(n = ifelse(n == 1, 1, 0))
Output
id hire_year dummy
1 1 2017 0
2 1 2017 0
3 2 2017 1
4 3 2017 0
5 3 2016 0
6 4 2016 1
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
structure(list(id = c(1L, 1L, 2L, 3L, 3L, 4L), hire_year = c(2017L,
2017L, 2017L, 2017L, 2016L, 2016L)), class = "data.frame", row.names = c(NA,
-6L)
) %>%
add_count(id, name = 'dummy') %>%
mutate(
dummy = as.integer(dummy == 1)
)
#> id hire_year dummy
#> 1 1 2017 0
#> 2 1 2017 0
#> 3 2 2017 1
#> 4 3 2017 0
#> 5 3 2016 0
#> 6 4 2016 1
Created on 2022-03-04 by the reprex package (v2.0.0)
We can use ave in base R like below
> transform(df, dummy = +(ave(id, id, FUN = length) == 1))
id hire_year dummy
1 1 2017 0
2 1 2017 0
3 2 2017 1
4 3 2017 0
5 3 2016 0
6 4 2016 1
A data.table solution:
library(data.table)
DT <- structure(list(id = c(1L, 1L, 2L, 3L, 3L, 4L), hire_year = c(2017L,
2017L, 2017L, 2017L, 2016L, 2016L)), class = "data.frame", row.names = c(NA,
-6L))
# Convert into data.table
setDT(DT)
# Count number of times "id" shows up
DT[, count := .N, by =.(id)]
# Create a dummy variable that equals 1 if count ==1
DT[, dummy := fifelse(count == 1,1,0)]
id hire_year count dummy
<int> <int> <int> <num>
1: 1 2017 2 0
2: 1 2017 2 0
3: 2 2017 1 1
4: 3 2017 2 0
5: 3 2016 2 0
6: 4 2016 1 1

In R: How to tell R that a value should be inserted into a categorical column while applying two conditions

What we have:
companyID year status
1 2010
1 2011
1 2012 2
1 2013
1 2014
2 2007
2 2008
2 2009 2
2 2010
2 2011
2 2012 1
2 2013
For companyID 1: I have the observation with status 2 in year 2012. I would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations after that (the status 2 in 2012) to a status of 2 (still per company).
For companyID 2: I have the observation with status 2 in year 2009. i would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations to status 2 until a status 1 shows up again (still per company).
(Summing up: Fill in the other value (1) before the one that is already there (2), then continue with 2 until there is another change (change as in: either that there is a new company or that there was a status change that had already been stated in the original dataframe))
This would then look like the following, and is what we want to acheive:
companyID year status
1 2010 1
1 2011 1
1 2012 2
1 2013 2
1 2014 2
2 2007 1
2 2008 1
2 2009 2
2 2010 2
2 2011 2
2 2012 1
2 2013 1
We have a large dataset and that is why this would not be possible manually. Is there a way to code for both of the companyID’s simultaneously (and hence for all the thousands of observations we have) in R?
Here is one way :
library(dplyr)
library(tidyr)
df %>%
group_by(companyID) %>%
fill(status) %>%
mutate(status = replace(status, is.na(status),
ifelse(na.omit(status)[1] == 1, 2, 1))) %>%
ungroup
# companyID year status
# <int> <int> <dbl>
# 1 1 2010 1
# 2 1 2011 1
# 3 1 2012 2
# 4 1 2013 2
# 5 1 2014 2
# 6 2 2007 1
# 7 2 2008 1
# 8 2 2009 2
# 9 2 2010 2
#10 2 2011 2
#11 2 2012 1
#12 2 2013 1
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), year = c(2010L, 2011L, 2012L, 2013L, 2014L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L), status = c(NA,
NA, 2L, NA, NA, NA, NA, 2L, NA, NA, 1L, NA)),
class = "data.frame", row.names = c(NA, -12L))

Merge unequal dataframes by matching two rows replace with 0 the missing values in R

I would like to create a new data frame by merging two unequal data frames by matching two columns and replace with 0 the missing values.
These are two examples of the data frames I have:
df1
ID YEAR INTERVIEW ID_HOUSEHOLD
1 2017 300
1 2018 300
1 2019 300
2 2017 150
2 2018 150
2 2019 150
3 2017 420
3 2018 420
df2
ID YEAR INTERVIEW YEARS_EDU
1 2017 10
1 2018 10
1 2019 10
3 2017 3
3 2018 3
*note that in the second data frame I don´t have information for individual 2
I would like to get the following data frame:
df3
df1
ID YEAR INTERVIEW ID_HOUSEHOLD YEARS_EDU
1 2017 300 10
1 2018 300 10
1 2019 300 10
2 2017 150 0
2 2018 150 0
2 2019 150 0
3 2017 420 3
3 2018 420 3
I am trying:
df3<-merge(df1,df2, by="ID", all=TRUE)
df3<-merge(df1,df2, by="ID","YEAR_INTERVIEW", all=TRUE)
The first option replicates hundreds of ID observations with years of interviews while the second gives me 0 values.
Any help would be much appreciated :) THANK YOU
The by needs to be a vector i.e. we can create a vector with c(). Also, all = TRUE, is a full join, but here, it should be a left join, so it is all.x = TRUE. If there is no match, then the element will be NA by default
out <- merge(df1,df2, by=c("ID","YEAR_INTERVIEW"), all.x=TRUE)
The NAs can be converted to 0
out$YEARS_EDU[is.na(out$YEARS_EDU)] <- 0
-output
out
# ID YEAR_INTERVIEW ID_HOUSEHOLD YEARS_EDU
#1 1 2017 300 10
#2 1 2018 300 10
#3 1 2019 300 10
#4 2 2017 150 0
#5 2 2018 150 0
#6 2 2019 150 0
#7 3 2017 420 3
#8 3 2018 420 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
YEAR_INTERVIEW = c(2017L,
2018L, 2019L, 2017L, 2018L, 2019L, 2017L, 2018L), ID_HOUSEHOLD = c(300L,
300L, 300L, 150L, 150L, 150L, 420L, 420L)), class = "data.frame",
row.names = c(NA,
-8L))
df2 <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L),
YEAR_INTERVIEW = c(2017L,
2018L, 2019L, 2017L, 2018L), YEARS_EDU = c(10L, 10L, 10L, 3L,
3L)), class = "data.frame", row.names = c(NA, -5L))

transposing rows into multiple columns in R

I have following data frame df
ID Year Var value
1 2011 x1 1.2
1 2011 x2 2
1 2012 x1 1.5
1 2012 x2 2.3
3 2013 x1 3
3 2014 x1 4
4 2015 x1 5
5 2016 x1 6
4 2016 x1 2
I want to transform the data in following format
ID Year x1 x2
1 2011 1.2 2
1 2011 2 NA
1 2012 1.5 2.3
3 2013 3 NA
3 2014 4 4
4 2015 5 NA
4 2016 2 NA
5 2016 6 NA
Please help
Using the tidyr library, I believe this is what you are looking for:
library(tidyr)
df <- data.frame(stringsAsFactors=FALSE,
ID = c(1L, 1L, 1L, 1L, 3L, 3L, 4L, 5L, 4L),
Year = c(2011L, 2011L, 2012L, 2012L, 2013L, 2014L, 2015L, 2016L, 2016L),
Var = c("x1", "x2", "x1", "x2", "x1", "x1", "x1", "x1", "x1"),
value = c(1.2, 2, 1.5, 2.3, 3, 4, 5, 6, 2)
)
df2 <- df %>%
spread(Var, value)

Proper way to split data frames at multiple levels using ddply [duplicate]

This question already has answers here:
Efficient method to filter and add based on certain conditions (3 conditions in this case)
(3 answers)
Closed 6 years ago.
Let's say I have a data frame like the following:
year stint ID W
1 2003 1 abc 10
2 2003 2 abc 3
3 2003 1 def 16
4 2004 1 abc 15
5 2004 1 def 11
6 2004 2 def 7
I would like to combine the data so that it looks like
year ID W
1 2003 abc 13
3 2003 def 16
4 2004 abc 15
5 2004 def 18
I found a way to combine the data as desired, but I'm very sure that there's a better way.
combinedData = unique(ddply(data, "ID", function(x) {
ddply(x, "year", function(y) {
data.frame(ID=x$ID, W=sum(y$W))
})
}))
combinedData[order(combinedData$year),]
This produces the following output:
year ID W
1 2003 abc 13
7 2003 def 16
4 2004 abc 15
10 2004 def 18
Specifically I don't like that I had to use unique (otherwise I get each unique combo of year,ID,W three times in the outputted data), and I don't like that the row numbers aren't sequential. How can I do this more cleanly?
Do this with base R:
aggregate(W~year+ID, df, sum)
# year ID W
#1 2003 abc 13
#2 2004 abc 15
#3 2003 def 16
#4 2004 def 18
data
df <- structure(list(year = c(2003L, 2003L, 2003L, 2004L, 2004L, 2004L
), stint = c(1L, 2L, 1L, 1L, 1L, 2L), ID = structure(c(1L, 1L,
2L, 1L, 2L, 2L), .Label = c("abc", "def"), class = "factor"),
W = c(10L, 3L, 16L, 15L, 11L, 7L)), .Names = c("year", "stint",
"ID", "W"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))

Resources