Computing similarity between duplicated variables using unique identifier - r

I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.
id
year
tlabor
rev
dupfreq
1
1419
2005
5
1072
2
2
1425
2005
42
2945
1
3
1419
2005
4
950
2
4
1443
2006
18
3900
1
5
1485
2006
118
35034
1
6
1419
2006
6
1851
1
I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.
I was thinking of something similar to this:
id
year
sim
1
1419
2005
0.83
Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.
Any advice is greatly appreciated! Thanks in advance!

Related

Is there a way to order output by multiple columns within the aggregate() function in R?

I'd like to use the aggregate function but then have the output be ordered (smallest to largest) based on 2 columns (first one, and then subset by the other).
Here is an example:
test<-data.frame(c(sample(1:4),1),sample(2001:2005),11:15,c(letters[1:4],'a'),sample(101:105))
names(test)<-c("plot","year","age","spec","biomass")
test
plot year age spec biomass
1 2 2001 11 a 102
2 4 2005 12 b 101
3 1 2004 13 c 105
4 3 2002 14 d 103
5 1 2003 15 a 104
aggregate(biomass~plot+year,data=test,FUN='sum')
This creates output with just year ordered from smallest to largest.
plot year biomass
1 2 2001 102
2 3 2002 103
3 1 2003 104
4 1 2004 105
5 4 2005 101
But I'd like the output to be ordered by plot and THEN year.
plot year biomass
1 1 2003 104
2 1 2004 105
3 2 2001 102
4 3 2002 103
5 4 2005 101
Thanks!!
The aggregate function does sort by columns. Switch the order of the arguments to get your desired sorting:
# switch from
a0 <- aggregate(biomass~plot+year,data=test,FUN='sum')
# to
a <- aggregate(biomass~year+plot,data=test,FUN='sum')
The data is sorted in the way described in the question. No further sorting is needed.
If you want to change the order in which columns are displayed to exactly match your desired output, try a[,c(1,3,2)]. This reordering is not computationally costly. My understanding is that a data.frame is a list of pointers to column vectors; and this just reorders that list.

Place row values in columns with the corresponding name

I need to change some of my datasets in the following way.
I have one panel dataset containing an unique firm id as identifier (id), the observation year (year, 2002-2012) and some firm variables with the value for the corresponding year (size, turnover etc,). It looks somewhat like:
[ID] [year] [size] [turnover] ...
1 2002 14 1200
1 2003 15 1250
1 2004 17 1100
1 2005 18 1350
2 2004 10 5750
2 2005 11 6025
...
I need to transform it now in the following way.
I create an own matrix for every characteristic of interest, where
every firm (according to its id) has only one row and the
corresponding values per year in separated columns.
It should be mentioned, that not every firm is in the dataset in every year, since they might be founded later, already closed down etc., as illustrated in the example. In the end it should look somehow like the following (example for size variable):
[ID] [2002] [2003] [2004] [2005]
1 14 15 17 18
2 - - 10 11
I tried it so far with the %in% command, but did not manage to get the values in the right columns.
DF <- read.table(text="[ID] [year] [size] [turnover]
1 2002 14 1200
1 2003 15 1250
1 2004 17 1100
1 2005 18 1350
2 2004 10 5750
2 2005 11 6025",header=TRUE)
library(reshape2)
dcast(DF, X.ID.~X.year.,value.var="X.size.")
# X.ID. 2002 2003 2004 2005
# 1 1 14 15 17 18
# 2 2 NA NA 10 11

How can I find the changes in a single column in R?

I have a "csv" file which includes 3 columns, 100+rows. The variables in all columns change according to the data placed in Column 1, the "Time".
Time Temp Cloud
1100 22 1
1102 14 1
1104 14 2
1106 23 1
1108 12 1
1110 21 2
1112 17 2
1114 12 3
1116 24 3
I want to know when "Cloud" changes [ e.g. at 3rd and 6th row ], and I want to obtain the other variables which is placed at that row, and the row before that row.
How can I do that ?
Thanks
diff will almost do this directly. Apply it twice. Calling your example data d:
> d[c(diff(d$Cloud) != 0,FALSE) | c(FALSE, diff(d$Cloud) != 0),]
Time Temp Cloud
2 1102 14 1
3 1104 14 2
4 1106 23 1
5 1108 12 1
6 1110 21 2
7 1112 17 2
8 1114 12 3
I would do something like this:
df$Change <- c(0,sign(diff(df$Cloud)))
subset(df,Change!=0)[,4]
This will eliminate rows where there are no changes.

Merging data frames in R

Let's say I have two data frames. Each has a DAY, a MONTH, and a YEAR column along with one other variable, C and P, respectively. I want to merge the two data frames in two different ways. First, I merge by data:
test<-merge(data1,data2,by.x=c("DAY","MONTH","YEAR"),by.y=c("DAY","MONTH","YEAR"),all.x=T,all.y=F)
This works perfectly. The second merge is the one I'm having trouble with. So, I currently I have merged the value for January 5, 1996 from data1 and the value for January 5, 1996 from data2 into one data frame, but now I would like to merge a third value onto each row of the new data frame. Specifically, I want to merge the value for Jan 4, 1996 from data2 with the two values from January 5, 1996. Any tips on getting merge to be flexible in this way?
sample data:
data1
C DAY MONTH YEAR
1 1 1 1996
6 5 1 1996
5 8 1 1996
3 11 1 1996
9 13 1 1996
2 14 1 1996
3 15 1 1996
4 17 1 1996
data2
P DAY MONTH YEAR
1 1 1 1996
4 2 1 1996
8 3 1 1996
2 4 1 1996
5 5 1 1996
2 6 1 1996
7 7 1 1996
4 8 1 1996
6 9 1 1996
1 10 1 1996
7 11 1 1996
3 12 1 1996
2 13 1 1996
2 14 1 1996
5 15 1 1996
9 16 1 1996
1 17 1 1996
Make a new column that is a Date type, not just some day,month,year integers. You can use as.Date() to do this, though you will need to look up the right format the format= argument given your string. Let's call that column D1. Now do data1$D2 = data1$D1 + 1. The key point here is that Date types allow simple date arithmetic. Now just merge by x=D1 and y=D2.
In case that was confusing, the bottom line is that you need to covert you columns to Date types so that you can do date arithmetic.

Function for certain values in rows

I have a paneldata which looks like:
(Only the substantially cutting for my question)
Persno 122 122 122 333 333 333 333 333 444 444
Income 1500 1500 2000 2000 2100 2500 2500 1500 2000 2200
year 1 2 3 1 2 3 4 5 1 2
I need a command or function to recognize the different Person. For all rows with the same Person I would like to give out the average income.
Thank you very much.
My favorite tool to solve problems like this is ddply, in the plyr package.
library(plyr)
p = pdata.frame(data.frame(year=rep(c(1,2,3),3), persno = c(1,1,1,2,2,2,3,3,3), income=c(1500,1500,2000,2000,2100,2500,2500,1500,2000)))
dply(p, .(persno), summarize, mean.income = mean(income))
which gives us the output
persno mean.income
1 1 1666.667
2 2 2200.000
3 3 2000.000

Resources