R - compare multiple columns and create new columns indicating matches - r

I'd like to know how I can compare multiple columns to the values in a single column, then use those matches to create a table of differences. I have a political dataset of policy outcomes, and whether certain organizations supported or opposed those outcomes, by year. Here's some mock data:
Outcome 0 means the law never happened, outcome 1 means it happened.
For organizations, a negative number means they opposed the law and positive means they supported it:
set.seed(123)
Data <- data.frame(
year = sample(1998:2004, 200, replace = TRUE),
outcome = sample(0:1, 200, replace = TRUE),
union = sample(-1:1, 200, replace = TRUE),
chamber = sample(-1:1, 200, replace = TRUE),
pharma = sample(-1:1, 200, replace = TRUE),
gun = sample(-1:1, 200, replace = TRUE),
dem = sample(-1:1, 200, replace = TRUE),
repub = sample(-1:1, 200, replace = TRUE)
)
I would like to know how many times an organization matched the support or opposition of the union, per year.
I imagine its going to be some table like this, where a match equals 1 and otherwise -1 (there are also many NAs in the data were organizations take no position):
DATA$contra <- ifelse(DATA$union == page.bin$chamber, 1, -1)
In the dataset, there's about 50 organizations in consecutive columns. It seems unwieldy to create 50 new columns, one for each match. Even if that is the best way to do it, I don't know how to apply the function to create 50 new columns.
Eventually, I'd like to create a heatmap or a way to visualize which organizations match the union column. But, first, I think, I need some kind of table of data.
Thanks for your help!

When you say "I would like to know how many times an organization matched the support or opposition of the union, per year." then I'm assuming that you want the net number of agreement, i.e. that a 1/1 vote or a -1/-1 vote pairing occurred and that from that you want subtracted the number of disagreement, and do not care about the number of times one of the votes was 0.
Before running your code I used set.seed(123) so there could be reproducibility:
> head(Data)
year outcome union chamber pharma gun dem repub
1 2000 0 1 -1 0 -1 1 -1
2 2003 1 -1 1 0 0 1 -1
3 2000 1 1 -1 -1 -1 0 -1
4 2004 1 0 -1 -1 1 1 0
5 2004 0 0 -1 -1 1 0 -1
6 1998 1 0 1 1 0 1 1
> head( Data[-(1:3)] * Data[[3]])
chamber pharma gun dem repub
1 -1 0 -1 1 -1
2 -1 0 0 -1 1
3 -1 -1 -1 0 -1
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
This makes 1/1 and -1/-1 pairings be all ==1 and -1/1 and 1/-1 pairings ==-1 and others ==0. Now one can aggregate this by year:
> head( aggregate( Data[-(1:3)] * Data[[3]], Data[1], sum) )
year chamber pharma gun dem repub
1 1998 0 -2 1 2 6
2 1999 0 0 2 4 3
3 2000 -3 2 -3 -4 -11
4 2001 2 3 2 9 1
5 2002 0 -1 7 9 1
6 2003 0 -2 -11 5 -2
If instead you only wanted the sum of only the agreements it would be:
> aggregate( Data[-(1:3)] * Data[[3]], Data[1], function(x) {sum(x==1)} )
year chamber pharma gun dem repub
1 1998 5 4 5 7 9
2 1999 8 7 7 9 9
3 2000 5 8 5 3 3
4 2001 7 9 7 11 4
5 2002 7 6 11 12 9
6 2003 7 5 1 8 5
7 2004 4 4 9 2 4

Using dplyr
library(dplyr)
Data %>%
select(-outcome) %>%
group_by(year, union) %>%
mutate_each(funs(union * .)) %>%
group_by(year) %>%
summarise_each(funs(sum(. == 1)), -union)
You get:
Source: local data frame [7 x 6]
year chamber pharma gun dem repub
1 1998 5 4 5 7 9
2 1999 8 7 7 9 9
3 2000 5 8 5 3 3
4 2001 7 9 7 11 4
5 2002 7 6 11 12 9
6 2003 7 5 1 8 5
7 2004 4 4 9 2 4
Using gather() from tidyr to get data in a tall format and ggvis heatmap
library(dplyr)
library(tidyr)
library(ggvis)
Data %>%
select(-outcome) %>%
group_by(year, union) %>%
mutate_each(funs(union * .)) %>%
group_by(year) %>%
summarise_each(funs(sum(. == 1)), -union) %>%
gather(org, value, -year) %>%
mutate(org = as.factor(org), year = as.factor(year)) %>%
ggvis(~year, ~org, fill=~value) %>%
layer_rects(width = band(), height = band()) %>%
layer_text(
x = prop("x", ~year, scale = "xcenter"),
y = prop("y", ~org, scale = "ycenter"),
text:=~value, fontSize := 14, fill:="white",
baseline:="middle", align:="center") %>%
scale_nominal("x", padding = 0, points = FALSE) %>%
scale_nominal("y", padding = 0, points = FALSE) %>%
scale_nominal("x", name = "xcenter", padding = 1, points = TRUE) %>%
scale_nominal("y", name = "ycenter", padding = 1, points = TRUE) %>%
hide_legend("fill")

Maybe the following helps. First, you create a new data frame that contains for each organisation and each row whether the support matched the union:
match.union <- data.frame(year=Data$year,
lapply(Data[,4:ncol(Data)],function(col) col==Data$union))
It is important to add the column with the year for the next step, which is to sum up the number of agreements with the union per year:
aggregate(.~year,match.union,sum)
The output I get from this is
year chamber pharma gun dem repub
1 1998 11 9 10 9 7
2 1999 10 8 16 9 14
3 2000 8 9 8 7 12
4 2001 7 9 10 9 13
5 2002 11 12 11 13 8
6 2003 5 7 8 5 6
7 2004 13 13 15 15 10

Related

How to add a new column to a dataframe using values from the rows?

I'm trying to calculate the sum multinomial distributions. Think of this as elections: Suppose there are 11 voters and three candidates. Candidate A has a 0.5 probability of being chosen, B has 0.3 and C has 0.2. I am interested in calculating the probability that A wins.
So I am taking all the possible scenarios in which A wins under plurality vote (the option which has the most votes wins) and calculating the probability of them happening and then summing all these values.
My problem comes when I try to calculate the multinomial distribution of each individual scenario in which A wins.
I have a dataframe with all the possible outcomes in which A wins and ideally I want a new column in which the probability of each happening is shown. Sort of like this:
V1 V2 V3 dmultinom(x = c(5, 3, 3), prob = options)
1 5 2 4 0.06237
2 5 3 3 0.06237
3 5 4 2 0.06237
4 6 0 5 0.06237
5 6 1 4 0.06237
6 6 2 3 0.06237
7 6 3 2 0.06237
8 6 4 1 0.06237
9 6 5 0 0.06237
10 7 0 4 0.06237
11 7 1 3 0.06237
12 7 2 2 0.06237
13 7 3 1 0.06237
14 7 4 0 0.06237
15 8 0 3 0.06237
16 8 1 2 0.06237
17 8 2 1 0.06237
18 8 3 0 0.06237
19 9 0 2 0.06237
20 9 1 1 0.06237
21 9 2 0 0.06237
22 10 0 1 0.06237
23 10 1 0 0.06237
24 11 0 0 0.06237
But with the right values, of course.
I tried to access the values of the rows using $ but with no success. I also tried to create a new column with the values of the rows as vectors using dsplyr but couldn't do it either.
Here's an option with data.table
library(data.table)
#create data.frame
xx <- data.frame(V1 = c(5, 5, 5, 6),
V2 = c(2, 3, 4, 0),
V3 = c(4, 3, 2, 5))
#convert the data.frame to a data.table
setDT(xx)
#put the data in long format
xx <- data.table::melt(xx,
measure.vars = names(xx))
#make a grouping variable
xx[, group := rep(1:4, 3)]
#apply function to each group
xx[, probability := dmultinom(value, prob = c(0.5, 0.3, 0.2)), by = "group"]
#pivot data back to wider format
yy <- data.table::dcast(xx[, !c("group")],
probability ~ variable,
value.var = "value")
> yy
probability V1 V2 V3
1: 0.00231000 6 0 5
2: 0.03118500 5 2 4
3: 0.06237000 5 3 3
4: 0.07016625 5 4 2
Not the prettiest solutions, but it can be accomplished with a for-loop.
First, creating the empty column:
dat$multinom <- 0
Next, iterate through the dataframe adding the multinomial with the inputs from V1, V2 and V3
for (i in 1:nrow(dat)) {
dat$multinom[i] <- dmultinom(x = c(dat$V1[i], dat$V2[i], dat$V3[i]), prob = options)
}

Difference Between Adjacent Column Pairs in SAS and r

I would like to take the difference between columns 1 & 2, 3 & 4, 5 & 6, 7 & 8, and so on.
I originally had 55 corresponding column pairs (110 columns total) and needed to get 55 difference columns. I ended up coding each column difference by hand, but I thought I could probably do this much more efficiently. Perhaps by the use of arrays in SAS. I would like to solve this problem in r as well.
Synthetic data is below and if anyone knows how to quickly generate sequential paired column names like var1_apple, var1_banana, var2_apple, var2_banana, var3_apple, var3_banana,..., in r (without just typing out a vector of column names) that would be very helpful as well.
Thank you!
## create a dataframe with random values of 1:10. ncols x nrows = 200
df <- data.frame(matrix(sample(1:10, 200, replace = TRUE), ncol = 20, nrow = 10))
EDIT -- added the "55 difference columns" part at the bottom.
Adjusting data to be column pairs:
df <- data.frame(matrix(sample(1:10, 200, replace = TRUE), ncol = 20, nrow = 10))
names(df) <- paste0("var", rep(1:10, each = 2), "_", rep(c("apple", "banana")))
names(df)
[1] "var1_apple" "var1_banana" "var2_apple" "var2_banana" "var3_apple" "var3_banana"
[7] "var4_apple" "var4_banana" "var5_apple" "var5_banana" "var6_apple" "var6_banana"
[13] "var7_apple" "var7_banana" "var8_apple" "var8_banana" "var9_apple" "var9_banana"
[19] "var10_apple" "var10_banana"
library(tidyverse)
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = c("var", ".value"), names_sep = "_")
# A tibble: 100 × 4
row var apple banana
<int> <chr> <int> <int>
1 1 var1 8 7
2 1 var2 4 9
3 1 var3 7 3
4 1 var4 6 10
5 1 var5 10 10
6 1 var6 1 1
7 1 var7 2 10
8 1 var8 7 9
9 1 var9 3 8
10 1 var10 2 6
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
Here a variation to add all the difference columns interspersed:
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = c("var", ".value"), names_sep = "_") %>%
mutate(difference = banana - apple) %>%
pivot_wider(names_from = var, values_from = apple:difference,
names_glue = "{var}_{.value}", names_vary = "slowest")
Result (truncated)
# A tibble: 10 × 10
row var1_apple var1_banana var1_difference var2_apple var2_banana var2_difference var3_apple var3_banana var3_difference
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 7 10 3 5 3 -2 1 9 8
2 2 9 2 -7 3 6 3 8 1 -7
3 3 2 10 8 3 3 0 7 8 1
4 4 3 1 -2 8 3 -5 9 9 0
5 5 2 7 5 7 10 3 6 9 3
6 6 5 4 -1 2 1 -1 5 4 -1
7 7 4 5 1 10 3 -7 9 4 -5
8 8 10 7 -3 3 2 -1 5 9 4
9 9 5 5 0 7 3 -4 10 7 -3
10 10 10 6 -4 1 4 3 10 10 0
Here is a SAS solution. As far as I understood your data looks like that: I tried to generate an example for 4 patients and 6 proteins, and two values for each protein (nasal and plasma).
data proteinvalues;
patient=1;
protein1_nas=12; protein1_plas=13; protein2_nas=6; protein2_plas=8;
protein3_nas=23; protein3_plas=24; protein4_nas=15; protein4_plas=15;
protein5_nas=45; protein5_plas=47; protein6_nas=56; protein6_plas=50;
output;
patient=2;
protein1_nas=1; protein1_plas=5; protein2_nas=6; protein2_plas=8;
protein3_nas=2; protein3_plas=4; protein4_nas=7; protein4_plas=9;
protein5_nas=3; protein5_plas=3; protein6_nas=8; protein6_plas=7;
output;
patient=3;
protein1_nas=0; protein1_plas=1; protein2_nas=20; protein2_plas=19;
protein3_nas=33; protein3_plas=5; protein4_nas=19; protein4_plas=20;
protein5_nas=32; protein5_plas=8; protein6_nas=12; protein6_plas=14;
output;
patient=4;
protein1_nas=4; protein1_plas=5; protein2_nas=9; protein2_plas=11;
protein3_nas=4; protein3_plas=6; protein4_nas=78; protein4_plas=70;
protein5_nas=4; protein5_plas=7; protein6_nas=78; protein6_plas=77;
output;
run;
OK, this data generating step is not quite elegant, but it works...
Here is my solution by using an array with all protein-variables as members. The value pairs are then adjacent members in the array, i.e. No.1/No.2 , No.3/No.4 etc...
/* Number of proteins, in your data 55, in my data 6 */
%let NUM_PROTEINS=6;
data result (keep=patient diff:);
set proteinvalues;
/* Array definition with all variable names starting with "protein" */
array protarr{*} protein:;
/* Array for the resulting values */
array diff(&NUM_PROTEINS.);
/* Initializing the protein number */
num_prot=0;
/* Loop over the protein by step 2 = one iteration per protein */
do n=1 to dim(protarr) by 2;
/* p is plasma, n is nasal value */
p=n+1;
/* setting the protein number */
num_prot+1;
/* calculate the difference, using sum-function to handle missing values */
diff{num_prot}=sum(protarr{p},-protarr{n});
end;
run;
I think #Tom's comment is spot-on. Restructuring the data probably makes sense if you are working with paired data. E.g.:
od <- names(df)[c(TRUE,FALSE)]
ev <- names(df)[c(FALSE,TRUE)]
data.frame(
odd = unlist(df[od]),
oddname = rep(od,each=nrow(df)),
even = unlist(df[ev]),
evenname = rep(ev,each=nrow(df))
)
## odd oddname even evenname
##X11 7 X1 10 X2
##X12 6 X1 1 X2
##X13 2 X1 6 X2
##X14 5 X1 2 X2
##X15 3 X1 1 X2
## ...
It is then trival to take one column from another in this structure.
If you must have the matrix-like output, then that is also achievable:
od <- names(df)[c(TRUE,FALSE)]
ev <- names(df)[c(FALSE,TRUE)]
setNames(df[od] - df[ev], paste(od, ev, sep="_"))
## X1_X2 X3_X4 X5_X6 X7_X8 X9_X10 X11_X12 X13_X14 X15_X16 X17_X18 X19_X20
##1 -3 2 4 4 -2 4 3 1 -3 9
##2 5 5 4 3 -1 3 -1 -3 5 -2
##3 -4 3 7 4 -5 1 1 5 -4 4
##4 3 0 6 3 4 -5 6 6 -7 4
##5 2 2 1 4 -6 -3 6 2 3 1
##6 -6 -2 4 -2 0 1 3 0 0 -7
##7 0 -6 3 7 -1 0 0 -5 3 1
##8 -1 3 3 1 2 -2 -5 3 0 0
##9 -4 1 -5 -2 -4 7 6 -2 4 -4
##10 2 -7 4 -1 0 -6 -4 -4 0 0

Data Frame- Add number of occurrences with a condition in R

I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.
This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1

Recreating a dataframe by using conditions from two different columns

I have a massive dataframe seems like this:
df = data.frame(year = c(rep(1998,5),rep(1999,5)),
loc = c(10,rep(14,4),rep(10,2),rep(14,3)),
sitA = c(rep(0,3),1,1,0,1,0,1,1),
sitB = c(1,0,1,0,1,rep(0,4),1),
n = c(2,13,2,9,4,7,2,7,7,4))
df
year loc sitA sitB n
1 1998 10 0 1 2
2 1998 14 0 0 13
3 1998 14 0 1 2
4 1998 14 1 0 9
5 1998 14 1 1 4
6 1999 10 0 0 7
7 1999 10 1 0 2
8 1999 14 0 0 7
9 1999 14 1 0 7
10 1999 14 1 1 4
As you can see, there are years, localities, two different situation (denoted as sitA and sitB) and finally the counts of these records (column n).
I wanted to create a new data frame which reflects the counts for only year and localities where counts for situation A and B stored in the columns conditionally such as desired output below:
df.new
year loc sitB.0.sitA.0 sitB.0.sitA.1 sitB.1.sitA.0 sitB.1.sitA.1
1 1998 10 0 0 2 0
2 1998 14 13 9 2 4
3 1999 10 7 2 0 0
4 1999 14 7 7 0 4
The tricky part as you can realize is that the original dataframe doesn't include all of the conditions. It only has the ones where the count is above 0. So the new dataframe should have "0" for the missing conditions in the original dataframe. Therefore, well known functions such as melt (reshape) or aggregate failed to solve my issue. A little help would be appreciated.
A tidyverse method, we first append the column names to the values for sit.. columns. Then we unite and combine them into one column and finaly spread the values.
library(tidyverse)
df[3:4] <- lapply(names(df)[3:4], function(x) paste(x, df[, x], sep = "."))
df %>%
unite(key, sitA, sitB, sep = ".") %>%
spread(key, n, fill = 0)
# year loc sitA.0.sitB.0 sitA.0.sitB.1 sitA.1.sitB.0 sitA.1.sitB.1
#1 1998 10 0 2 0 0
#2 1998 14 13 2 9 4
#3 1999 10 7 0 2 0
#4 1999 14 7 0 7 4
If the position of the columns is not fixed you can use grep first
cols <- grep("^sit", names(df))
df[cols] <- lapply(names(df)[cols], function(x) paste(x, df[, x], sep = "."))

r how to get total count of repeated values

I have a dataframe with person_id, study_id columns like below:
person_id study_id
10 1
11 2
10 3
10 4
11 5
I want to get the count for number of persons (unique by person_id) with 1 study or 2 studies - so not those with particular value for study_id but:
2 persons with 1 study
3 persons with 2 studies
1 person with with 3 studies
etc
How can I do this? I think maybe a count through loop but I wonder if there is a package that makes it easier?
To get a sample data set that better matches your expected output, i'll use this
dd <- data.frame(
person_id = c(10, 11, 15, 12, 10, 13, 10, 11, 12, 14, 15),
study_id = 1:11
)
Now I can count the number of people with a given number of studies with.
table(rowSums(with(dd, table(person_id, study_id))>0))
# 1 2 3
# 2 3 1
Where the top line is the number of studies, and the bottom line it the number of people with that number of studies.
This works because
with(dd, table(person_id, study_id))
returns
study_id
person_id 1 2 3 4 5 6 7 8 9 10 11
10 1 0 0 0 1 0 1 0 0 0 0
11 0 1 0 0 0 0 0 1 0 0 0
12 0 0 0 1 0 0 0 0 1 0 0
13 0 0 0 0 0 1 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 0
15 0 0 1 0 0 0 0 0 0 0 1
and then we use >0 and rowSums to get a count of unique studies for each person. Then we use table again to summarize the results.
The creating the table for your data is taking up too much RAM, you can try
table(with(dd, tapply(study_id, person_id, function(x) length(unique(x)))))
which is a slightly different way to get at the same thing.
You can use the aggregate function to get counts per user.
Then use it again to get counts per counts
i.e. assume your data is called "test"
person_id study_id
10 1
11 2
10 3
10 4
11 5
12 NA
You can set your NA to be a number such as zero so they are not ignored i.e.
test$study_id[is.na(test$study_id)] = 0
Then you can run the same function but with a condition that the study_id has to be greater than zero
stg=setNames(
aggregate(
study_id~person_id,
data=test,function(x){sum(x>0)}),
c("person_id","num_studies"))
Output:
stg
person_id num_studies
10 3
11 2
12 0
Then do the same to get counts of counts
setNames(
aggregate(
person_id~num_studies,
data=stg,length),
c("num_studies","num_users"))
Output:
num_studies num_users
0 1
2 1
3 1
Here's a solution using dplyr
library(dplyr)
tmp <- df %>%
group_by(person_id) %>%
summarise(num.studies = n()) %>%
group_by(num.studies) %>%
summarise(num.persons = n())
> dat <- read.table(h=T, text = "person_id study_id
10 1
11 2
10 3
10 4
11 5
12 6")
I think you can just use xtabs for this. I may have misunderstood the question, but it seems like that's what you want.
> table(xtabs(dat))
# 10 11 12
# 3 2 1
df <- data.frame(
person_id = c(10,11,10,10,11,11,11),
study_id = c(1,2,3,4,5,5,1))
# remove replicated rows
df <- unique(df)
# number of studies each person has been in:
summary(as.factor(df$person_id))
#10 11
# 3 4
# number of people in each study
summary(as.factor(df$study_id))
# 1 2 3 4 5
# 2 1 1 1 2

Resources