I have a massive dataframe seems like this:
df = data.frame(year = c(rep(1998,5),rep(1999,5)),
loc = c(10,rep(14,4),rep(10,2),rep(14,3)),
sitA = c(rep(0,3),1,1,0,1,0,1,1),
sitB = c(1,0,1,0,1,rep(0,4),1),
n = c(2,13,2,9,4,7,2,7,7,4))
df
year loc sitA sitB n
1 1998 10 0 1 2
2 1998 14 0 0 13
3 1998 14 0 1 2
4 1998 14 1 0 9
5 1998 14 1 1 4
6 1999 10 0 0 7
7 1999 10 1 0 2
8 1999 14 0 0 7
9 1999 14 1 0 7
10 1999 14 1 1 4
As you can see, there are years, localities, two different situation (denoted as sitA and sitB) and finally the counts of these records (column n).
I wanted to create a new data frame which reflects the counts for only year and localities where counts for situation A and B stored in the columns conditionally such as desired output below:
df.new
year loc sitB.0.sitA.0 sitB.0.sitA.1 sitB.1.sitA.0 sitB.1.sitA.1
1 1998 10 0 0 2 0
2 1998 14 13 9 2 4
3 1999 10 7 2 0 0
4 1999 14 7 7 0 4
The tricky part as you can realize is that the original dataframe doesn't include all of the conditions. It only has the ones where the count is above 0. So the new dataframe should have "0" for the missing conditions in the original dataframe. Therefore, well known functions such as melt (reshape) or aggregate failed to solve my issue. A little help would be appreciated.
A tidyverse method, we first append the column names to the values for sit.. columns. Then we unite and combine them into one column and finaly spread the values.
library(tidyverse)
df[3:4] <- lapply(names(df)[3:4], function(x) paste(x, df[, x], sep = "."))
df %>%
unite(key, sitA, sitB, sep = ".") %>%
spread(key, n, fill = 0)
# year loc sitA.0.sitB.0 sitA.0.sitB.1 sitA.1.sitB.0 sitA.1.sitB.1
#1 1998 10 0 2 0 0
#2 1998 14 13 2 9 4
#3 1999 10 7 0 2 0
#4 1999 14 7 0 7 4
If the position of the columns is not fixed you can use grep first
cols <- grep("^sit", names(df))
df[cols] <- lapply(names(df)[cols], function(x) paste(x, df[, x], sep = "."))
Related
I am trying to identify column names with matching substrings, and then calculate the differences of the values in those columns.
Sample data:
V1_ABC <- c(1,2,3,4)
V2_ABC <- c(2,3,4,5)
V1_WXYZ <- c(10,11,12,13)
V2_WXYZ <- c(11,12,13,14)
Date <- c(2001,2002,2003,2004)
So df looks like:
df <- data.frame(Date, V1_ABC, V2_ABC, V1_WXYZ, V2_WXYZ)
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ
1 2001 1 2 10 11
2 2002 2 3 11 12
3 2003 3 4 12 13
4 2004 4 5 13 14
I want to calculate V1 minus V2 for ABC and WXYZ. My original dataset is much larger, so I don't want to do this manually for each. I'd like to automate this so that R compares the column headers and finds which columns have the same ending substring (V1_ABC and V2_ABC, and V1_WXYZ and V2_WXYZ), then subtracts the V2_ from the V1_. Like this:
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
1 2001 1 2 10 11 -1 -1
2 2002 2 3 11 12 -1 -1
3 2003 3 4 12 13 -1 -1
4 2004 4 5 13 14 -1 -1
Most of the functions I have found such as grep or intersect either look for a specific string you input, or return the values where the vectors are the same.
Any ideas on how to automate pairing based on names/substrings?
You could stack V1 and V2 separately, calculate the differences, and reshape them back to the wide form. This approach can deal with any numbers of pairs of V1_xxx and V2_xxx.
library(tidyverse)
df %>%
pivot_longer(contains("_"), names_to = c(".value", "grp"), names_sep = "_") %>%
mutate(dif = V1 - V2) %>%
pivot_wider(names_from = grp, values_from = c(V1, V2, dif))
# # A tibble: 4 × 7
# Date V1_ABC V1_WXYZ V2_ABC V2_WXYZ dif_ABC dif_WXYZ
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2001 1 10 2 11 -1 -1
# 2 2002 2 11 3 12 -1 -1
# 3 2003 3 12 4 13 -1 -1
# 4 2004 4 13 5 14 -1 -1
Here is a base R solution. You mention that your data frame is large so this checks for columns where there are exactly 2 shared suffixes and only operates on those. It assumes that they are all of the format "V1_suffix" and "V2_suffix" but could be easily modified if they are in other formats.
suffixes <- unlist(regmatches(names(df), gregexpr("_.+", names(df))))
# Limit to suffixes where there are 2
suffixes <- names(table(suffixes)[table(suffixes) == 2])
diffs <- sapply(suffixes,
\(suffix) df[[paste0("V1", suffix)]] - df[[paste0("V2", suffix)]]
)
diff_df <- data.frame(diffs) |>
setNames(paste0("dif", suffixes))
cbind(df, diff_df)
# Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
# 1 2001 1 2 10 11 -1 -1
# 2 2002 2 3 11 12 -1 -1
# 3 2003 3 4 12 13 -1 -1
# 4 2004 4 5 13 14 -1 -1
This question already has answers here:
Check if each row of a data frame is contained in another data frame
(4 answers)
Closed 4 years ago.
I would like to find matching observations in two different datasets based on two variables.
The first dataset "df1" exists of the following two variables:
SessionID MarkerID
14 5
14 5
14 5
14 8
17 9
17 9
17 8
17 2
17 9
The othere dataset "df2" exists of the same two variables
SessionID MarkerID
14 5
17 8
17 2
Now, I would like to add another variable "Match" to df1 that shows if a match was found between the two datasets (Match = 1) or not (Match = 0) for an observation. The observation should have the same value for both the SessionID AND MarkerID.
The desired output looks as follows:
SessionID MarkerID Match
14 5 1
14 5 1
14 5 1
14 8 0
17 9 0
17 9 0
17 8 1
17 2 1
17 9 0
Reproducable example:
SessionID <- c(14,14,14,14,17,17,17,17,17)
MarkerID <- c(5,5,5,8,9,9,8,2,9)
df1 <- as.data.frame(cbind(SessionID, MarkerID))
SessionID <- c(14,17,17)
MarkerID <- c(5,8,2)
df2 <- as.data.frame(cbind(SessionID,MarkerID))
I have tried the following code but it did not produce the desired output:
df1$Match <- 0
df1$Match[which(df1$MarkerID == df2$MarkerID & df1$SessionID == df2$SessionID )] <- 1
Here is a possibility using match
df1$Match <- ifelse(is.na(match(
paste0(df1$SessionID, df1$MarkerID, sep = "_"),
paste0(df2$SessionID, df2$MarkerID, sep = "_"))), 0, 1)
df1;
# SessionID MarkerID Match
#1 14 5 1
#2 14 5 1
#3 14 5 1
#4 14 8 0
#5 17 9 0
#6 17 9 0
#7 17 8 1
#8 17 2 1
#9 17 9 0
Explanation: We concatenate SessionID and MarkerID entries in both data.frames and use match to identify matching rows; ifelse marks matching entries with 1 and NA (non-matching) entries with 0.
If you want to avoid ifelse you can also do
df1$Match <- as.numeric(!is.na(match(
paste0(df1$SessionID, df1$MarkerID, sep = "_"),
paste0(df2$SessionID, df2$MarkerID, sep = "_"))))
This works for me:
df1$Match <- as.numeric(do.call(paste, c(df1, sep = "-")) %in% do.call(paste, c(df2, sep = "-")))
You can use left_join
df1 %>%
left_join(df2 %>% mutate(Match = 1), by = c('SessionID', 'MarkerID')) %>%
mutate(Match = ifelse(is.na(Match), 0 , Match))
# SessionID MarkerID Match
# 1 14 5 1
# 2 14 5 1
# 3 14 5 1
# 4 14 8 0
# 5 17 9 0
# 6 17 9 0
# 7 17 8 1
# 8 17 2 1
# 9 17 9 0
I am trying to loop the merging of two dataframes over multiple columns, but I'm having trouble with the code and haven't been able to find any answers on SO. Here are some example data frames:
box <- c(5,7,2)
year <- c(1999,1999,1999)
rep5 <- c(5,5,5)
rep7 <- c(7,7,7)
rep2 <- c(2,2,2)
df1 <- data.frame(box,year,rep5,rep7,rep2)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
df1
box year rep5 rep7 rep2
1 5 1999 5 7 2
2 7 1999 5 7 2
3 2 1999 5 7 2
df2
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
What I am trying to do is get the distance information from df2 into df1, with df1 year matched to df2 year, df1 box matched to df2 box1, and df1 rep[i] matched to df2 box2. I can do this for a single df1 rep[i] column as follows:
merge(df1, df2, by.x=c("box", "rep5", "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)
this gives the desired output:
box rep5 year rep7 rep2 distance
1 2 5 1999 7 2 200
2 5 5 1999 7 2 0
3 7 5 1999 7 2 100
However, in order to save doing this for each rep[i] column individually (I have a lot of these columns in the real data set), I'd like to be able to loop over those columns. Here is the code I have tried to do that:
reps <- c(df1$rep7, df1$rep2)
df3 <- for (i in reps) {merge(df1, df2, by.x=c("box", i, "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)}
df3
When I run that code, I get the error "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column." I also tried defining
reps <- c("rep7", "rep2")
When I run the same code using that definition, I get the result that df3 is NULL.
The output that I want (with the distance column renamed for clarity) is:
box year rep5 rep7 rep2 dist5 dist7 dist2
1 2 1999 5 7 2 200 300 0
2 5 1999 5 7 2 0 100 200
3 7 1999 5 7 2 100 0 300
What am I doing wrong? Any help you can give me would be very much appreciated!
My R life became so much easier when I learned about the libraries dplyr and tidyr, and the concept of tidy data sets. What you're trying to do above can be expressed as a pivot, and is pretty easy to do with dplyr and tidyr.
I'm assuming what you really want, is to turn df2:
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
into your output, with all those strange repetitions removed:
box year dist5 dist7 dist2
1 2 1999 200 300 0
2 5 1999 0 100 200
3 7 1999 100 0 300
So you should pivot box2 into columns, with your distance as the value. using dplyr and tidyr:
library(tidyr)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
# reshape it as desired
spread(df2, box2, distance,fill=0)
#Source: local data frame [4 x 7]
# box1 year2 2 4 5 7 9
#1 2 1999 0 0 200 300 300
#2 5 1999 200 0 0 100 0
#3 5 2000 0 0 0 0 0
#4 7 1999 300 200 100 0 0
My recommendation: learn to use dplyr and tidyr. It makes life so, so much easier.
I'd like to know how I can compare multiple columns to the values in a single column, then use those matches to create a table of differences. I have a political dataset of policy outcomes, and whether certain organizations supported or opposed those outcomes, by year. Here's some mock data:
Outcome 0 means the law never happened, outcome 1 means it happened.
For organizations, a negative number means they opposed the law and positive means they supported it:
set.seed(123)
Data <- data.frame(
year = sample(1998:2004, 200, replace = TRUE),
outcome = sample(0:1, 200, replace = TRUE),
union = sample(-1:1, 200, replace = TRUE),
chamber = sample(-1:1, 200, replace = TRUE),
pharma = sample(-1:1, 200, replace = TRUE),
gun = sample(-1:1, 200, replace = TRUE),
dem = sample(-1:1, 200, replace = TRUE),
repub = sample(-1:1, 200, replace = TRUE)
)
I would like to know how many times an organization matched the support or opposition of the union, per year.
I imagine its going to be some table like this, where a match equals 1 and otherwise -1 (there are also many NAs in the data were organizations take no position):
DATA$contra <- ifelse(DATA$union == page.bin$chamber, 1, -1)
In the dataset, there's about 50 organizations in consecutive columns. It seems unwieldy to create 50 new columns, one for each match. Even if that is the best way to do it, I don't know how to apply the function to create 50 new columns.
Eventually, I'd like to create a heatmap or a way to visualize which organizations match the union column. But, first, I think, I need some kind of table of data.
Thanks for your help!
When you say "I would like to know how many times an organization matched the support or opposition of the union, per year." then I'm assuming that you want the net number of agreement, i.e. that a 1/1 vote or a -1/-1 vote pairing occurred and that from that you want subtracted the number of disagreement, and do not care about the number of times one of the votes was 0.
Before running your code I used set.seed(123) so there could be reproducibility:
> head(Data)
year outcome union chamber pharma gun dem repub
1 2000 0 1 -1 0 -1 1 -1
2 2003 1 -1 1 0 0 1 -1
3 2000 1 1 -1 -1 -1 0 -1
4 2004 1 0 -1 -1 1 1 0
5 2004 0 0 -1 -1 1 0 -1
6 1998 1 0 1 1 0 1 1
> head( Data[-(1:3)] * Data[[3]])
chamber pharma gun dem repub
1 -1 0 -1 1 -1
2 -1 0 0 -1 1
3 -1 -1 -1 0 -1
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
This makes 1/1 and -1/-1 pairings be all ==1 and -1/1 and 1/-1 pairings ==-1 and others ==0. Now one can aggregate this by year:
> head( aggregate( Data[-(1:3)] * Data[[3]], Data[1], sum) )
year chamber pharma gun dem repub
1 1998 0 -2 1 2 6
2 1999 0 0 2 4 3
3 2000 -3 2 -3 -4 -11
4 2001 2 3 2 9 1
5 2002 0 -1 7 9 1
6 2003 0 -2 -11 5 -2
If instead you only wanted the sum of only the agreements it would be:
> aggregate( Data[-(1:3)] * Data[[3]], Data[1], function(x) {sum(x==1)} )
year chamber pharma gun dem repub
1 1998 5 4 5 7 9
2 1999 8 7 7 9 9
3 2000 5 8 5 3 3
4 2001 7 9 7 11 4
5 2002 7 6 11 12 9
6 2003 7 5 1 8 5
7 2004 4 4 9 2 4
Using dplyr
library(dplyr)
Data %>%
select(-outcome) %>%
group_by(year, union) %>%
mutate_each(funs(union * .)) %>%
group_by(year) %>%
summarise_each(funs(sum(. == 1)), -union)
You get:
Source: local data frame [7 x 6]
year chamber pharma gun dem repub
1 1998 5 4 5 7 9
2 1999 8 7 7 9 9
3 2000 5 8 5 3 3
4 2001 7 9 7 11 4
5 2002 7 6 11 12 9
6 2003 7 5 1 8 5
7 2004 4 4 9 2 4
Using gather() from tidyr to get data in a tall format and ggvis heatmap
library(dplyr)
library(tidyr)
library(ggvis)
Data %>%
select(-outcome) %>%
group_by(year, union) %>%
mutate_each(funs(union * .)) %>%
group_by(year) %>%
summarise_each(funs(sum(. == 1)), -union) %>%
gather(org, value, -year) %>%
mutate(org = as.factor(org), year = as.factor(year)) %>%
ggvis(~year, ~org, fill=~value) %>%
layer_rects(width = band(), height = band()) %>%
layer_text(
x = prop("x", ~year, scale = "xcenter"),
y = prop("y", ~org, scale = "ycenter"),
text:=~value, fontSize := 14, fill:="white",
baseline:="middle", align:="center") %>%
scale_nominal("x", padding = 0, points = FALSE) %>%
scale_nominal("y", padding = 0, points = FALSE) %>%
scale_nominal("x", name = "xcenter", padding = 1, points = TRUE) %>%
scale_nominal("y", name = "ycenter", padding = 1, points = TRUE) %>%
hide_legend("fill")
Maybe the following helps. First, you create a new data frame that contains for each organisation and each row whether the support matched the union:
match.union <- data.frame(year=Data$year,
lapply(Data[,4:ncol(Data)],function(col) col==Data$union))
It is important to add the column with the year for the next step, which is to sum up the number of agreements with the union per year:
aggregate(.~year,match.union,sum)
The output I get from this is
year chamber pharma gun dem repub
1 1998 11 9 10 9 7
2 1999 10 8 16 9 14
3 2000 8 9 8 7 12
4 2001 7 9 10 9 13
5 2002 11 12 11 13 8
6 2003 5 7 8 5 6
7 2004 13 13 15 15 10
I would like to create a panel from a dataset that has one observation for every given time period such that every unit has a new observation for every time period. Using the following example:
id <- seq(1:4)
year <- c(2005, 2008, 2008, 2007)
y <- c(1,0,0,1)
frame <- data.frame(id, year, y)
frame
id year y
1 1 2005 1
2 2 2008 0
3 3 2008 0
4 4 2007 1
For each unique ID, I would like there to be a unique observation for the year 2005, 2006, 2007, and 2008 (the lower and upper time periods on this frame), and set the outcome y to 0 for all the times in which there isn't an existing observation, such that the new frame looks like:
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
....
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
I haven't had much success with loops; Any and all thoughts would be greatly appreciated.
1) reshape2 Create a grid g of all years and id values crossed and rbind it with frame.
Then using the reshape2 package cast frame from long to wide form and then melt it back to long form. Finally rearrange the rows and columns as desired.
The lines ending in one # are only to ensure that every year is present so if we knew that were the case those lines could be omitted. The line ending in ## is only to rearrange the rows and columns so if that did not matter that line could be omitted too.
library(reshape2)
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
wide <- dcast(frame, year ~ id, fill = 0, fun = sum, value.var = "y")
long <- melt(wide, id = "year", variable.name = "id", value.name = "y")
long <- long[order(long$id, long$year), c("id", "year", "y")] ##
giving:
> long
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
5 2 2005 0
6 2 2006 0
7 2 2007 0
8 2 2008 0
9 3 2005 0
10 3 2006 0
11 3 2007 0
12 3 2008 0
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
2) aggregate A shorter solution would be to run just the two lines that end with # above and then follow those with an aggregate as shown. This solution uses no addon packages.
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
aggregate(y ~ year + id, frame, sum)[c("id", "year", "y")]
This gives the same answer as solution (1) except as noted by a commenter solution (1) above makes id a factor whereas it is not in this solution.
Using data.table:
require(data.table)
DT <- data.table(frame, key=c("id", "year"))
comb <- CJ(1:4, 2005:2008) # like 'expand.grid', but faster + sets key
ans <- DT[comb][is.na(y), y:=0L] # perform a join (DT[comb]), then set NAs to 0
# id year y
# 1: 1 2005 1
# 2: 1 2006 0
# 3: 1 2007 0
# 4: 1 2008 0
# 5: 2 2005 0
# 6: 2 2006 0
# 7: 2 2007 0
# 8: 2 2008 0
# 9: 3 2005 0
# 10: 3 2006 0
# 11: 3 2007 0
# 12: 3 2008 0
# 13: 4 2005 0
# 14: 4 2006 0
# 15: 4 2007 1
# 16: 4 2008 0
maybe not an elegant solution, but anyway:
df <- expand.grid(id=id, year=unique(year))
frame <- frame[frame$y != 0,]
df$y <- 0
df2 <- rbind(frame, df)
df2 <- df2[!duplicated(df2[,c("id", "year")]),]
df2 <- df2[order(df2$id, df2$year),]
rownames(df2) <- NULL
df2
# id year y
# 1 1 2005 1
# 2 1 2006 0
# 3 1 2007 0
# 4 1 2008 0
# 5 2 2005 0
# 6 2 2006 0
# 7 2 2007 0
# 8 2 2008 0
# 9 3 2005 0
# 10 3 2006 0
# 11 3 2007 0
# 12 3 2008 0
# 13 4 2005 0
# 14 4 2006 0
# 15 4 2007 1
# 16 4 2008 0