Below is the sample data and one manipulation. The first data set is employment specific to an industry. The second data set is overall employment and unemployment rate. I am seeking to do a left join (or at least that's what I think it should be) to achieve the desired result below. When I do it, I get a one to many issue with the row count growing. In this example, it goes from 14 to 18. In the larger data set, it goes from 228 to 4348. Primary question is if this can be done with only a properly written join script or is there more to it?
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
month<-c(1,2,3,4,5,6,7,8,9,10,11,12,1,2)
emp1 <-c(10,11,12,13,14,15,16,17,20,21,22,24,26,28)
firstset<-data.frame(area1,periodyear,month,emp1)
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear1<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
period<-c(01,02,03,04,05,06,07,08,09,10,11,12,01,02)
rate<-c(3.0,3.2,3.4,3.8,2.5,4.5,6.5,9.1,10.6,5.5,7.8,6.5,4.5,2.9)
emp2<-c(1001,1002,1005,1105,1254,1025,1078,1106,1099,1188,1254,1250,1301,1188)
secondset<-data.frame(area2,periodyear1,period,rate,emp2)
secondset <- secondset%>%mutate(month = as.numeric(period))
secondset <- left_join(firstset,secondset, by=c("month"))
Desired Result (14 rows with below being the first 3)
area1 periodyear month emp1 rate emp2
000000 2020 1 10 3.0 1001
000000 2020 2 11 3.2 1002
000000 2020 3 12 3.4 1005
We may have to add 'periodyear' as well in the by
library(dplyr)
left_join(firstset,secondset, by=c("periodyear" = "periodyear1",
"area1" = "area2", "month"))
-output
area1 periodyear month emp1 period rate emp2
1 0 2020 1 10 1 3.0 1001
2 0 2020 2 11 2 3.2 1002
3 0 2020 3 12 3 3.4 1005
...
I have run across similar question, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to randomly sample (with replacement) within each group and the number of resampling events must equal the number of samples (i.e., rows) per group. Additionally, the nested groups have multiple columns of data. See the example df below.
I have code using the dplyr package, but am moving away from dplyr as I have to continuously update my code as dplyr changes function names and operations...which is annoying to say the least. Yes...I know there are several ways to circumvent this issue, but have decided it is time to cast aside the dplyr crutches and learn how to execute data wrangling using R base package.
Working dplyr code:
Resample_function = function(Boot)
{group_by(data1, GROUP, YEAR) %>%
slice(sample(n(), replace = TRUE))%>%
ungroup()
}
I have tried to use various combinations of aggregate, ave, and the apply family of functions...but my ability to deal with nested group designs in base package is limited to say the least.
Below I have provided an example data set (df) and what the results should look like. Note that the resampling produce will produce different results, but the number of resamples per nested group should be the same.
One final request...I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. Additionally, some of these packages can be more efficient than base package. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP YEAR VAR1 VAR2
a 2018 1.0 1.0
a 2018 2.0 2.0
b 2018 10 10
b 2018 20 20
b 2018 30 30
b 2018 40 40
b 2019 50 50
b 2019 60 60
b 2019 70 70
b 2019 80 80
b 2019 90 90
b 2019 100 100
b 2019 110 110
b 2019 120 120
b 2019 130 130
b 2019 140 140
b 2019 150 150
b 2019 160 160
b 2019 170 170
b 2019 180 180
b 2020 190 190
b 2020 200 200
b 2020 210 210", header = TRUE)
result <- read.table(text = "GROUP YEAR VAR1 VAR2
a 2018 1 1
a 2018 1 1
b 2018 20 20
b 2018 30 30
b 2018 30 30
b 2018 20 20
b 2019 70 70
b 2019 170 170
b 2019 50 50
b 2019 150 150
b 2019 70 70
b 2019 150 150
b 2019 100 100
b 2019 120 120
b 2019 50 50
b 2019 160 160
b 2019 90 90
b 2019 150 150
b 2019 170 170
b 2019 180 180
b 2020 190 190
b 2020 190 190
b 2020 190 190", header = TRUE)
You can perform this kind of shuffling in base R using ave :
Resample_function <- function(data) {
new_data <- data[with(data, ave(seq(nrow(data)), GROUP, YEAR,
FUN = function(x) sample(x, replace = TRUE))), ]
rownames(new_data) <- NULL
return(new_data)
}
Resample_function(df)
I am importing data that is neither long nor wide:
clear
input str1 id purchased sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
My goal is to get the data in the following long format, reflecting the count in each year:
Identifier Year Inventory
A 2016 0
A 2017 1
A 2018 1
A 2019 1
B 2016 0
B 2017 0
B 2018 0
B 2019 0
C 2016 1
C 2017 1
C 2018 2
C 2019 1
D 2016 0
D 2017 0
D 2018 2
D 2019 1
My initial approach would be to transform it first into a wide format, that is having only one row per identifier, and adding columns for years between 2016-2018. And then converting this format into the desired long format. However, this seems to be inefficient.
Is there any shorter and more efficient method to do this, as I have a much larger dataset?
This needs several small tricks. The most crucial are reshape long and fillin.
The inventory is essentially a running sum of purchases minus sales.
clear
input str1 Identifier Purchased Sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
generate long id = _n
rename (Purchased Sold) year=
reshape long year, i(id) j(Event) string
drop id
fillin Id year
drop _fillin
drop if missing(year)
bysort Id (year Event) : generate inventory = sum((Event == "Purchased") - (Event == "Sold"))
drop Event
bysort Id year : keep if _n == _N
list, sepby(Id)
+----------------------------+
| Identi~r year invent~y |
|----------------------------|
1. | A 2016 0 |
2. | A 2017 1 |
3. | A 2018 1 |
4. | A 2019 1 |
|----------------------------|
5. | B 2016 0 |
6. | B 2017 0 |
7. | B 2018 0 |
8. | B 2019 0 |
|----------------------------|
9. | C 2016 1 |
10. | C 2017 1 |
11. | C 2018 2 |
12. | C 2019 1 |
|----------------------------|
13. | D 2016 0 |
14. | D 2017 0 |
15. | D 2018 2 |
16. | D 2019 1 |
+----------------------------+
I have a table that consists of names, points, and years. I need a command to return all the names for a specific year even if the name isn't included in that year. Example
Name Points Year
------- -------
tom 8 2011
jim 45 2011
jerry 25 2011
zack 124 2011
jeff 45 2011
tom 62 2012
jim 214 2012
jerry 13 2012
zack 32 2012
arnold 4 2012
Name Points Year
------- -------
tom 8 2011
jim 45 2011
jerry 25 2011
zack 124 2011
jeff 45 2011
arnold NULL NULL
I figured this would be easy but I am struggling to make it work.
From your explanation, I'm thinking you need could use something like this:
SELECT DISTINCT
N.`Name`,
D.`Points`,
Y.`Year`
FROM
`MyData` Y
LEFT JOIN (SELECT DISTINCT `Name` FROM `MyData`) N ON 1=1
LEFT JOIN `MyData` D
ON D.`Year` = Y.`Year`
AND D.`Name` = N.`Name`
ORDER BY
Y.`Year`
While It's not pretty, It does seem to work as intended:
I have table with below rows
Name Month Salary Expense
John Jan 1000 50
John Feb 5000 2000
Jack Jan 3000 100
I want to get output in the below format. How to achieve this.
Name JAN FEB
John 1000 50 5000 2000
Jack 3000 100 0 0
This sql(-server) query would work:
select name,
isnull(max(case when month='jan' then salary end), 0) as Salary_jan,
isnull(max(case when month='feb' then salary end), 0) as Salary_feb
-- and so on
group by name