I need to change some of my datasets in the following way.
I have one panel dataset containing an unique firm id as identifier (id), the observation year (year, 2002-2012) and some firm variables with the value for the corresponding year (size, turnover etc,). It looks somewhat like:
[ID] [year] [size] [turnover] ...
1 2002 14 1200
1 2003 15 1250
1 2004 17 1100
1 2005 18 1350
2 2004 10 5750
2 2005 11 6025
...
I need to transform it now in the following way.
I create an own matrix for every characteristic of interest, where
every firm (according to its id) has only one row and the
corresponding values per year in separated columns.
It should be mentioned, that not every firm is in the dataset in every year, since they might be founded later, already closed down etc., as illustrated in the example. In the end it should look somehow like the following (example for size variable):
[ID] [2002] [2003] [2004] [2005]
1 14 15 17 18
2 - - 10 11
I tried it so far with the %in% command, but did not manage to get the values in the right columns.
DF <- read.table(text="[ID] [year] [size] [turnover]
1 2002 14 1200
1 2003 15 1250
1 2004 17 1100
1 2005 18 1350
2 2004 10 5750
2 2005 11 6025",header=TRUE)
library(reshape2)
dcast(DF, X.ID.~X.year.,value.var="X.size.")
# X.ID. 2002 2003 2004 2005
# 1 1 14 15 17 18
# 2 2 NA NA 10 11
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
merging on multiple columns R
(1 answer)
How do I combine two data-frames based on two columns? [duplicate]
(3 answers)
Closed 2 months ago.
This post was edited and submitted for review 2 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have two data frames: df1 and df2. They both have four columns; three with the same names ID, Year and Week and one that are different from each other.
>df1
ID Year Oxygen Week
---- ------ ------- -------
1 2004 18 1
1 2005 17 1
2 2006 17 1
2 2007 18 1
3 2008 19 1
3 2010 20 1
3 2010 20 1
4 2012 16 1
5 2013 18 1
6 2014 18 1
>df2
ID Year Kg Week
---- ------ ----- -------
1 2004 20 1
1 2005 35 2
2 2006 30 2
3 2007 15 1
3 2008 70 2
4 2009 40 1
5 2013 55 1
6 2012 40 1
6 2014 10 2
7 2013 15 1
I would like to produce a new data frame which contains the rows from df1 only when the combination of ID and Year in df1 also are present in df2. The Week might be the same or not for that row, but I don't want to take the column Week into account. So the first row in df1 has 1 for ID and 2004 for Year which also occurs in df2. The combination of ID and Year for the second row in df1 does also occur in df2but have different value for Week.
I know how to do it if it only depends on one column:
df3 <- subset(df1, ID %in% df2$ID)
There was a solution for this when I didn't have the column Week which was:
df3 <- df1 %>% inner_join(df2)
But I don't know how to make it depend on both the ID and Year at the same time without it also takes Week into account.
I should end up with the following data frame, which only contain the columns from df1:
>df3
ID Year Oxygen Week
---- ------ ------- -------
1 2004 18 1
1 2005 17 1
2 2006 17 1
3 2008 19 1
4 2012 16 1
5 2013 18 1
6 2014 18 1
tidyverse approach
library(dplyr)
df3 <-
df1 %>%
inner_join(df2)
I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.
id
year
tlabor
rev
dupfreq
1
1419
2005
5
1072
2
2
1425
2005
42
2945
1
3
1419
2005
4
950
2
4
1443
2006
18
3900
1
5
1485
2006
118
35034
1
6
1419
2006
6
1851
1
I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.
I was thinking of something similar to this:
id
year
sim
1
1419
2005
0.83
Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.
Any advice is greatly appreciated! Thanks in advance!
I got a large DF with about 250K obs. and 19 variables about peoples graduation data. People are in the DF multiple times. I need to make two new DF where every person appears only once (ID) per DF. My true DF has more columns but for the example i took out 3 that are needed and 2 as an example of other background information in my DF.
ID - a persons unique ID
Level - Shows the level of education. Higher the number the better: Example 8 = Phd, 7 = Masters
Year - The year of the graduation
Field - background information on the field of study. We need to keep this data.
Gender - Gender of the person. We need to keep this data.
Orignal DF named "Graduations"
ID Level Year Field Gender
1 4 2016 31 M
1 5 2016 43 M
2 6 2010 12 F
2 7 2012 12 F
2 8 2017 19 F
3 5 2011 12 F
3 5 2009 31 F
4 6 2018 43 M
4 6 2018 44 M
5 5 2015 19 M
5 6 2011 32 M
DF 1 is named "Highest" meaning highest level of graduation. In this DF we only keep rows where the person has reached their highest level of education. If a person has reached it more then once then we keep the row with the highest and newest graduation and if both are a tie then we choose one randomly.
"Highest"
ID Level Year Field Gender
1 5 2016 43 M (Row kept because it was the highest level)
2 8 2017 19 F (Row kept because it was the highest level)
3 5 2011 12 F (Row kept because it was the highest level and newest)
4 6 2018 43 M (Row chosen randomly)
5 6 2011 32 M (Row Kept because it was the Highest level)
DF 2 is named "Last" meaning the last obtained level of graduation. In this DF we only keep row with the newest data. If a person has had more then one graduation per year then we choose the highest level if that is also a tie then at random (Does not need to be the same random row as was in DF1)
"Last"
ID Level Year Field Gender
1 5 2016 43 M (Row kept because it was the newest and highest level)
2 8 2017 19 F (Row kept because it was the newest)
3 5 2011 12 F (Row kept because it was the newest)
4 6 2018 44 M (Row chosen randomly)
5 5 2015 19 M (Row Kept because it was the newest)
I tried to search for it on the stackoverflow but did not find what i was looking for. I might have used wrong keywords.
I prefer base R functions but if needed the most common packages are also ok. This is because this DF is on a secure computer and installing uncommon packages takes forever as i must make a request for it. Most common ones are already installed.
Thank you for your help.
A data.table soolution
It's not quite elegent but it works.
library(data.table)
dt <- fread('ID Level Year Field Gender
1 4 2016 31 M
1 5 2016 43 M
2 6 2010 12 F
2 7 2012 12 F
2 8 2017 19 F
3 5 2011 12 F
3 5 2009 31 F
4 6 2018 43 M
4 6 2018 44 M
5 5 2015 19 M
5 6 2011 32 M')
df1 <- dt[,head(.SD[Level==max(Level)],1),by = .(ID)]
df2 <- dt[,highest_lvl_year:=max(Level),.(ID,Year)][,
flag:=fifelse(Year==max(Year),fifelse(Level==highest_lvl_year,"Y","N"),"N"),by=.(ID)][,
head(.SD[flag=="Y"],1),by=.(ID)]
df1
#> ID Level Year Field Gender
#> 1: 1 5 2016 43 M
#> 2: 2 8 2017 19 F
#> 3: 3 5 2011 12 F
#> 4: 4 6 2018 43 M
#> 5: 5 6 2011 32 M
df2[,1:5]
#> ID Level Year Field Gender
#> 1: 1 5 2016 43 M
#> 2: 2 8 2017 19 F
#> 3: 3 5 2011 12 F
#> 4: 4 6 2018 43 M
#> 5: 5 5 2015 19 M
Created on 2020-04-20 by the reprex package (v0.3.0)
I'd like to use the aggregate function but then have the output be ordered (smallest to largest) based on 2 columns (first one, and then subset by the other).
Here is an example:
test<-data.frame(c(sample(1:4),1),sample(2001:2005),11:15,c(letters[1:4],'a'),sample(101:105))
names(test)<-c("plot","year","age","spec","biomass")
test
plot year age spec biomass
1 2 2001 11 a 102
2 4 2005 12 b 101
3 1 2004 13 c 105
4 3 2002 14 d 103
5 1 2003 15 a 104
aggregate(biomass~plot+year,data=test,FUN='sum')
This creates output with just year ordered from smallest to largest.
plot year biomass
1 2 2001 102
2 3 2002 103
3 1 2003 104
4 1 2004 105
5 4 2005 101
But I'd like the output to be ordered by plot and THEN year.
plot year biomass
1 1 2003 104
2 1 2004 105
3 2 2001 102
4 3 2002 103
5 4 2005 101
Thanks!!
The aggregate function does sort by columns. Switch the order of the arguments to get your desired sorting:
# switch from
a0 <- aggregate(biomass~plot+year,data=test,FUN='sum')
# to
a <- aggregate(biomass~year+plot,data=test,FUN='sum')
The data is sorted in the way described in the question. No further sorting is needed.
If you want to change the order in which columns are displayed to exactly match your desired output, try a[,c(1,3,2)]. This reordering is not computationally costly. My understanding is that a data.frame is a list of pointers to column vectors; and this just reorders that list.
Let's say I have two data frames. Each has a DAY, a MONTH, and a YEAR column along with one other variable, C and P, respectively. I want to merge the two data frames in two different ways. First, I merge by data:
test<-merge(data1,data2,by.x=c("DAY","MONTH","YEAR"),by.y=c("DAY","MONTH","YEAR"),all.x=T,all.y=F)
This works perfectly. The second merge is the one I'm having trouble with. So, I currently I have merged the value for January 5, 1996 from data1 and the value for January 5, 1996 from data2 into one data frame, but now I would like to merge a third value onto each row of the new data frame. Specifically, I want to merge the value for Jan 4, 1996 from data2 with the two values from January 5, 1996. Any tips on getting merge to be flexible in this way?
sample data:
data1
C DAY MONTH YEAR
1 1 1 1996
6 5 1 1996
5 8 1 1996
3 11 1 1996
9 13 1 1996
2 14 1 1996
3 15 1 1996
4 17 1 1996
data2
P DAY MONTH YEAR
1 1 1 1996
4 2 1 1996
8 3 1 1996
2 4 1 1996
5 5 1 1996
2 6 1 1996
7 7 1 1996
4 8 1 1996
6 9 1 1996
1 10 1 1996
7 11 1 1996
3 12 1 1996
2 13 1 1996
2 14 1 1996
5 15 1 1996
9 16 1 1996
1 17 1 1996
Make a new column that is a Date type, not just some day,month,year integers. You can use as.Date() to do this, though you will need to look up the right format the format= argument given your string. Let's call that column D1. Now do data1$D2 = data1$D1 + 1. The key point here is that Date types allow simple date arithmetic. Now just merge by x=D1 and y=D2.
In case that was confusing, the bottom line is that you need to covert you columns to Date types so that you can do date arithmetic.