I'll apologise in advance - I know this has likely been answered elsewhere, but I don't seem to be able to find the answer I need, and can't manage to adapt other code I have found to my needs.
I have a data frame:
FILE | TECHNIQUE | COUNT
------------------------
A | ONE | 10
A | TWO | 25
B | ONE | 5
B | TWO | 30
C | ONE | 30
C | TWO | 50
I would like to produce a data frame of the difference of the COUNT values between ONE and TWO, with a row for each FILE, i.e.
FILE | DIFFERENCE
-----------------
A | 15
B | 25
C | 20
I'm convinced I should be able to do this fairly easily with base R or Plyr, but am a bit stuck. Could anyone suggest a good way to do this, and perhaps good tutorials on Plyr that might help me with similar problems in the future?
Thanks
Using aggregate in base:
> aggregate(.~FILE, data= DF[, -2], FUN=diff)
FILE COUNT
1 A 15
2 B 25
3 C 20
Using ddply in plyr
> ddply(DF[,-2], .(FILE), summarize, DIFFERENCE=diff(COUNT))
FILE DIFFERENCE
1 A 15
2 B 25
3 C 20
with data.table
> # library(data.table)
> DT <- data.table(DF)
> DT[, diff(COUNT), by=FILE]
FILE V1
1: A 15
2: B 25
3: C 20
with by
> with(DF, by(COUNT, FILE, diff))
FILE: A
[1] 15
-----------------------------------------------------------------------------
FILE: B
[1] 25
-----------------------------------------------------------------------------
FILE: C
[1] 20
with tapply
> tapply(DF$COUNT, DF$FILE, diff)
A B C
15 25 20
with summaryBy from doBy package
> # library(doBy)
> summaryBy(COUNT~FILE, FUN=diff, data=DF)
FILE COUNT.diff
1 A 15
2 B 25
3 C 20
Update
As percentage:
> aggregate(.~FILE, data= DF[, -2], function(x) (x[1]/x[2])*100)
FILE COUNT
1 A 40.00000
2 B 16.66667
3 C 60.00000
Related
I'll start by saying that filling in missing data in one data frame with info from another has one solution that may work for my problem. However, it solves it with a FOR loop, and I would prefer a vectorized solution.
I have 125 years of climate data with year, month, temperature, precipitation, and open pan evaporation. It is daily data summarized by month. Some years in the late 1800's have entire months missing, and I would like to substitute those missing months with its equivalent month from a 30-year average around that time.
I have pasted some of the code I've been playing with, below:
# For simplicity, let's pretend there are 5 months in the year, so year 3
# is the only year with a complete set of data, years 1 and 2 are missing some.
df1<-structure(
list(
Year=c(1,1,1,2,2,3,3,3,3,3),
Month=c(1,2,4,2,5,1,2,3,4,5),
Temp=c(-2,2,10,-4,12,2,4,8,14,16),
Precip=c(20,10,50,10,60,26,18,40,60,46),
Evap=c(2,6,30,4,48,4,10,32,70,40)
)
)
# This represents the 30-year average data:
df2<-structure(
list(
Month=c(1,2,3,4,5),
Temp=c(1,3,9,13,15),
Precip=c(11,13,21,43,35),
Evap=c(1,5,13,35,45)
)
)
# to match my actual setup
df1<-as_tibble(df1)
df2<-as_tibble(df2)
# I can get to the list of months missing from a given year
full_year <- df2[,1]
compare_year1 <- df1[df1$Year==1,2]
missing_months <- setdiff(full_year,compare_year1)
# Or I can get the full data from each year missing one or more months
year_full <- df2[,1]
years_compare <- split(df1[,c(2)], df1$Year)
years_missing_months <- names(years_compare[sapply(years_compare,nrow)<5])
complete_years_missing_months <- df1[df1$Year %in% years_missing_months,]
This is where I've gotten stumped.
I've looked at anti_join and merge, but it looks like they need data of the same length in each frame. I can get from lists grouped by year to identify the years that are missing months, but I'm not sure how to actually get the rows inserted from there. It seems like lapply could be useful, but the answer ain't comin'.
Thanks in advance.
Edit 7/19: As an illustration of what I need, just looking at year "1", the current data (df1) has the following:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 4 | 10 | 50 | 30
Months 3 and 5 are missing data, so I would like to insert the equivalent-month data from the 30-year average table (df2), so the final result for year "1" would look like:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 3 | 9 | 21 | 13
1 | 4 | 10 | 50 | 30
1 | 5 | 15 | 35 | 45
Then fill in every year missing months in like manner. Year "3" would have no change, because (in this 5-month example) there are no months missing data.
First just add rows to hold the imputed values, since you know that there are missing rows with known dates:
df1$date <- as.Date(paste0("200",df1$Year,"/",df1$Month,"/01"))
pretend_12months <- seq(min(df1$date),max(df1$date),by = "1 month")
pretend_5months <- pretend_12months[lubridate::month(pretend_12months) < 6]
pretend_5months <- data.frame(date=pretend_5months)
new <- merge(df1,
pretend_5months,
by="date",
all=TRUE)
new$Year <- ifelse(is.na(new$Year),
substr(lubridate::year(new$date),4,4),
new$Year)
new$Month <- ifelse(is.na(new$Month),
lubridate::month(new$date),
new$Month)
Impute the NA values using a left join:
# key part: left join using any library or builtin method (left_join,merge, etc)
fillin <- sqldf::sqldf("select a.date,a.Year,a.Month, b.Temp, b.Precip, b.Evap from new a left join df2 b on a.Month = b.Month")
# apply data set from join to the NA data
new$Temp[is.na(new$Temp)] <- fillin$Temp[is.na(new$Temp)]
new$Precip[is.na(new$Precip)] <- fillin$Precip[is.na(new$Precip)]
new$Evap[is.na(new$Evap)] <- fillin$Evap[is.na(new$Evap)]
date Year Month Temp Precip Evap
1 2001-01-01 1 1 -2 20 2
2 2001-02-01 1 2 2 10 6
3 2001-03-01 1 3 9 21 9
4 2001-04-01 1 4 10 50 30
5 2001-05-01 1 5 15 35 15
6 2002-01-01 2 1 1 11 1
7 2002-02-01 2 2 -4 10 4
8 2002-03-01 2 3 9 21 9
9 2002-04-01 2 4 13 43 13
10 2002-05-01 2 5 12 60 48
11 2003-01-01 3 1 2 26 4
12 2003-02-01 3 2 4 18 10
13 2003-03-01 3 3 8 40 32
14 2003-04-01 3 4 14 60 70
15 2003-05-01 3 5 16 46 40
I have two data sets, and both have same dimensions, and want to combine them such that 1st column of second data set is stacked next to 1st column of first data set, and so on.
Consider below example, which is the expected output. Here, v1 is coming from data set 1, and v2 is coming from data set 2. I also want to keep the column header as it is.
| v1 | v2 |
|:------:|:------:|
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
I tried cbind() and data.frame(), but both led to second data being added after the full first data set, not column after column.
-> dim(firstDataSet)
100 200
-> dim(secondDataSet)
100 200
-> finalDataSet_cbind <- cbind(firstDataSet, secondDataSet)
-> dim(finalDataSet_cbind)
100 400
-> finalDataSet_dframe <- data.frame(firstDataSet, secondDataSet)
-> dim(finalDataSet_dframe)
100 400
Please suggest correct and better ways to achieve this, thanks.
UPDATE: Response to possible duplicate flag to this question:
That answer didn't work out for me. The data I get after following the solution, didn't result in what I want, and was similar to final output I get with cbind() approach explained above.
The first answer given, works out for me, but with a small issue of new column name assigned to each column, instead of keeping the original column headers.
Also, I don't have enough reputation to add comment to the accepted answer.
Probably not the most efficient solution with for loop, but works
data1 <- cbind(1:10,11:20, 21:30)
data2 <- cbind(1:10,11:20, 21:30)
combined <- NULL
for(i in 1:ncol(data1)){
combined <- cbind(combined, data1[,i], data2[,i])
}
To fix the column name requirement, you could do this. Basically, you first cbind, then you create an index in the right order. Using that index, you also create a vector of correct column names. You then index the order of the columns, and add the column names.
df1 <- df2 <- data.frame(v1=1:10,v2=11:20, v3=21:30)
final <- cbind(df1,df2)
indexed <- rep(1:ncol(df1), each = 2) + (0:1) * ncol(df1)
new_colnames <- colnames(final)[indexed]
final_ordered <- final[indexed]
colnames(final_ordered) <- new_colnames
v1 v1 v2 v2 v3 v3
1 1 1 11 11 21 21
2 2 2 12 12 22 22
3 3 3 13 13 23 23
4 4 4 14 14 24 24
5 5 5 15 15 25 25
6 6 6 16 16 26 26
7 7 7 17 17 27 27
8 8 8 18 18 28 28
9 9 9 19 19 29 29
10 10 10 20 20 30 30
Dataset looks like this:
ID | DELAY | PERIOD | TYPE
A 5 30 days 1
A 1 60 days 1
A 2 30 days 2
A 1 30 days 2
B 2 30 days 2
C 10 30 days 2
Output should look like:
ID | AV.DELAY_PERIOD_30DAYS_TYPE_1 | AV.DELAY_PERIOD_30DAYS_TYPE_2 ...
A 5 1.5
So, essentially, I need to build a function that builds new variables - average delay, unique per ID, split by each unique period and type.
Anyone?
You can use the aggregate function in R.
df1<-aggregate(df,by=list(df$id,df$period,df$type),FUN=mean)
We can use dcast from data.table to do the summarisation as well as reshaping
library(data.table)
dcast(setDT(df1), ID~paste0("AV.DELAY_PERIOD_",PERIOD) + TYPE, value.var = "DELAY", mean)
# ID AV.DELAY_PERIOD_30 days_1 AV.DELAY_PERIOD_30 days_2
#1: A 5 1.5
#2: B NaN 2.0
#3: C NaN 10.0
# AV.DELAY_PERIOD_60 days_1
#1: 1
#2: NaN
#3: NaN
I have the two following tables:
df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))
eth ZIP1 ZIP2 A B C
A 1 1 56 49 69
B 1 2 98 10 2
B 2 3 43 69 59
A 3 4 4 30 8
C 5 5 90 10 17
6 19 4 84
7 59 95 39
I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:
eth ZIP1 Inc
A 1 56
B 1 49
B 2 10
A 3 43
C 5 17
A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..
Thanks !
Sure, it can be done in data.table:
library(data.table)
setDT(df)
df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"),
Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]
The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].
You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.
Alternately...
setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]
This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.
We can use row/column indexing
df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]
df
# eth ZIP1 Inc
#1 A 1 56
#2 B 1 49
#3 B 2 10
#4 A 3 43
#5 C 5 17
What about this?
library(reshape2)
merge(df, melt(Inc, id="ZIP2"), by.x = c("ZIP1", "eth"), by.y = c("ZIP2", "variable"))
ZIP1 eth value
1 1 A 56
2 1 B 49
3 2 B 10
4 3 A 43
5 5 C 17
Another option:
library(dplyr)
library(tidyr)
Inc %>%
gather(eth, value, -ZIP2) %>%
left_join(df, ., by = c("eth", "ZIP1" = "ZIP2"))
my solution(which maybe seems awkward)
for (i in 1:length(df$eth)) {
df$Inc[i] <- Inc[as.character(df$eth[i])][df$ZIP[i],]
}
I have the following data frame in R that has overlapping data in the two columns a_sno and b_sno
a_sno<- c(4,5,5,6,6,7,9,9,10,10,10,11,13,13,13,14,14,15,21,21,21,22,23,23,24,25,183,184,185,185,200)
b_sno<-c(5,4,6,5,7,6,10,13,9,13,14,15,9,10,14,10,13,11,22,23,24,21,21,25,21,23,185,185,183,184,200)
df = data.frame(a_sno, b_sno)
If you take a close look at the data you can see that the 4,5,6&7 intersect/ overlap and I need to put them into a group called 1.
Like wise 9,10,13,14 into group 2, 11 and 15 into group 3 etc.... and 200 is not intersecting with any other row but still need to be assigned its own group.
The resulting output should look like this:
---------
group|sno
---------
1 | 4
1 | 5
1 | 6
1 | 7
2 | 9
2 | 10
2 | 13
2 | 14
3 | 11
3 | 15
4 | 21
4 | 22
4 | 23
4 | 24
4 | 25
5 | 183
5 | 184
5 | 185
6 | 200
Any help to get this done is much appreciated. Thanks
Probably not the most efficient solution but you could use graphs to do this:
#sort the data by row and remove duplicates
df = unique(t(apply(df,1,sort)))
#load the library
library(igraph)
#make a graph with your data
graph <-graph.data.frame(df)
#decompose it into components
components <- decompose.graph(graph)
#get the vertices of the subgraphs
result<-lapply(seq_along(components),function(i){
vertex<-as.numeric(V(components[[i]])$name)
cbind(rep(i,length(vertex)),vertex)
})
#make the final dataframe
output<-as.data.frame(do.call(rbind,result))
colnames(output)<-c("group","sno")
output