Panel Data - sum by group and create new variable - r

I know there are already a lot of questions on "sum by group" posed, however, I do not get solved my problem. Here is it:
df1 is my simplified data set
> df1 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
df2 is the desired result (see var2):
> df2 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301),
var2= c(130,130,700,700,35,35,350,350,132,132,702,702) )
So I would like to calculate the sums of var1 grouped by ID and the first two integers of category
So if the first two integers of the variable category is 09 (or 10 and so on), then assign to var2 the sum by group ID and the first two integers of category. Then, equal IDs in the same category should get assigned the same sum.
I tried to achiev that by
> df1$var2 = rep(NA, rep(length(df1$ID)))
df1$var2 = ifelse(substr(df1$category,1,2)=="09", by(df1[Year==2009,]$var1, df1[Year==2009,]$ID,sum), df1$var2)
df1$Var2 = ifelse(substr(df1$category,1,2)=="10", by(df1[Year==2010,]$var1, df1[Year==2010,]$ID,sum), df1$var1)
But here the sums are not assigned to the correct item.
Could somebody help me out?

df1 = data.frame( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910",NA,"0911","0913", "0914", "0910","0910",NA,"1014","1012",NA,"1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
I added NA values in OP's original dataframe to reflect the full specification he desired.
df1$category_sub = substr(df1$category, 1, 2)
df1_aggre = aggregate(var1 ~ ID + category_sub, data = df1, sum)
names(df1_aggre)[3] = "var2"
df2 = merge(df1, df1_aggre, all=TRUE)
df2[order(df2$Year),]
Result:
> df2[order(df2$Year),]
ID category_sub Year category var1 var2
1 1621 09 2009 0910 60 60
4 1621 <NA> 2009 <NA> 70 NA
5 1628 09 2009 0911 400 700
6 1628 09 2009 0913 300 700
9 3101 09 2009 0914 15 35
10 3101 09 2009 0910 20 35
11 3105 09 2009 0910 200 200
12 3105 <NA> 2009 <NA> 150 NA
2 1621 10 2010 1014 61 132
3 1621 10 2010 1012 71 132
7 1628 10 2010 1013 301 301
8 1628 <NA> 2010 <NA> 401 NA
I first extracted the first two integers from category and grouped var1 by ID and category_sub. I then renamed var1 to var2 and merged df1 and df1_aggre by ID and category_sub with all=TRUE option. This specifies a full outer join. The resulting dataframe was unsorted, so I sorted df2 by Year to get the desired result.

Related

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

Convert values using a conversion table R

I am currently running statistical models on ACT and SAT scores. To help clean my data, I want to convert the ACT scores into its SAT equivalent. I found the following table online:
ACT SAT
<dbl> <dbl>
1 36 1590
2 35 1540
3 34 1500
4 33 1460
5 32 1430
6 31 1400
7 30 1370
8 29 1340
9 28 1310
10 27 1280
I want to replace the column ACT_Composite with the number in the SAT column of the conversion table. For instance, if one row displays an ACT_Composite score of 35, I want to input 1540.
If anyone has ideas on how to accomplish this, I would greatly appreciate it.
In base you can you use merge directly:
#Reading score table
df <- read.table(header = TRUE, text ="ACT SAT
36 1590
35 1540
34 1500
33 1460
32 1430
31 1400
30 1370
29 1340
28 1310
27 1280")
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.frame with 50 sample scores
df1 <- data.frame(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1 31 1400
2 31 1400
3 31 1400
4 31 1400
5 31 1400
6 36 1590
In data.table you can you use merge
library(data.table)
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.table with 50 sample scores
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430
Alternatively in data.table you can try also
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
setDT(df)# you need to convert your look-up table df into data.table
result <- df[df1, on = c(ACT = "ACT_Composite")]
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430

Divide 2 columns from 2 different dataframes

Does anybody know how to divide two columns from two different dataframes when there are multiple columns to id from?
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
var1 = c("99", "99",
"99", "99")
value <- seq(1:length(month))
df1 = data.frame(name, month, var1, value)
df2 = df1
df2["var1"] = c("992", "992", "992", "992")
df2["value"] = c(2, 4, 6, 8)
df1
df2
Output
> df1
name month var1 value
1 A oct 2018 99 1
2 A nov 2018 99 2
3 B oct 2018 99 3
4 B nov 2018 99 4
> df2
name month var1 value
1 A oct 2018 992 2
2 A nov 2018 992 4
3 B oct 2018 992 6
4 B nov 2018 992 8
Does anybody know how to create a new dataframe that divides the "value"-column in df2 by the value column of df1? The method should be possible to use also when there are more columns than in the current example.
In base R, we can do merge
df3 <- merge(df1, df2, by = c("name", "month"))
df3$value <- df3$value.x/df3$value.y
df3
# name month var1.x value.x var1.y value.y value
#1 A nov 2018 99 2 992 4 0.5
#2 A oct 2018 99 1 992 2 0.5
#3 B nov 2018 99 4 992 8 0.5
#4 B oct 2018 99 3 992 6 0.5
You can drop value.x and value.y column if they are not needed.
Join the two data frames together and then perform the division and drop unwanted columns that were generated by the join (assuming you want computed value column to replace the value columns from the original data frames). Depending on what you want you may need a different *_join.
library(dplyr)
df1 %>%
inner_join(df2, by = c("name", "month")) %>%
mutate(value = value.x / value.y) %>%
select(-value.x, -value.y)
giving:
name month var1.x var1.y value
1 A oct 2018 99 992 0.5
2 A nov 2018 99 992 0.5
3 B oct 2018 99 992 0.5
4 B nov 2018 99 992 0.5
We can use data.table as well to do a join and create the column 'value' by dividing the 'value' column by the corresponding column in the other dataset while joining on 'name' and 'month'
library(data.table)
df3 <- copy(df1)
setDT(df3)[df2, value := value/i.value, on = .(name, month)]
df3
# name month var1 value
#1: A oct 2018 99 0.5
#2: A nov 2018 99 0.5
#3: B oct 2018 99 0.5
#4: B nov 2018 99 0.5

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

Resources