R Matching based on mutiple columns - r

Let's say I have the following data set, which acts like the key
x y value
1 2 10
1 1 20
2 1 30
1 1 20
2 3 200
I have another data with many many columns, 2 of them being x and y. I want to create a column value that matches to the key, e.g.
x y value and other columns
1 1 20
2 1 30
2 3 300
I can only use the match to make it work when matching one column. How do I extend to multiple column matching?

You can use merge, as #MrFlick suggested:
df.key <- data.frame(
x=c(1,1,2,1,2),
y=c(2,1,1,1,3),
value=c(10,20,30,20,200))
##
df.add <- data.frame(
x=c(1,2,2),
y=c(1,1,3),
value=c(20,30,300),
a=rnorm(3),
b=rpois(3,0))
##
> merge(
x=df.key,
y=df.add)
x y value a b
1 1 1 20 0.9246104 0
2 1 1 20 0.9246104 0
3 2 1 30 0.2685016 0
##
> merge(
x=df.key,
y=df.add,
by=c("x","y"))
x y value.x value.y a b
1 1 1 20 20 0.9246104 0
2 1 1 20 20 0.9246104 0
3 2 1 30 30 0.2685016 0
4 2 3 200 300 -0.4174230 0
By default, this will join on the intersection of column names, like in the first example (x,y,value). Additionally, you can specify which columns to use from both data.frames using by=, as in the second example. Or, you can get more specific by using by.x= and/or by.y=. See ?merge.
Edit:
The problem is that df.key contains two rows where x=1, y=1 is TRUE, so the row in df.add with x=1,y=1 has to be duplicated in the join in order to preserve the data in df.key. I'm not sure how to make this adjustment elegantly (e.g. by specifying certain arguments to merge), but here's one approach:
R> merge(
x=df.key[!duplicated(df.key[,c(1:2)]),],
y=df.add)
x y value a b
1 1 1 20 -1.0185211 0
2 2 1 30 2.7507656 0
3 2 3 200 0.3986168 0

Related

Use if-else function on data frame with multiple values

I have a data frame that contains multiple values in each spot, like this:
ID<-c(1,1,1,2,2,2,2,3,3,4,4,4,5,6,6)
W<-c(29,72,32,33,34,44,42,78,32,42,18,26,10,34,39)
df1<-data.frame(ID, W)
df<-ddply(df1, .(ID), summarize,
X=paste(unique(W),collapse=","))
ID X
1 1 29,72,32
2 2 33,34,44,42
3 3 78,32
4 4 42,18,26
5 5 10
6 6 34,39
I am trying to generate another column using an if-else function so that every ID that has an X value greater than 70 will show a 1, and all others will show a 0, like this:
ID X Y
1 1 29,72,32 1
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
This is the code that I tried:
df$Y <- ifelse(df$X>=70, 1, 0)
But it doesn't work; it only seems to put the first value of each spot through the function:
ID X Y
1 1 29,72,32 0
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
It worked fine on my one column that has only one value per spot. Is there a way to get to the if-else function to evaluate every value in each spot and assign a 1 if any of them fit the statement?
Thank you, I'm sorry that I do not know a lot of R vocabulary yet.
As 'X' is a string, we can split the 'X' at the , to create a list of vectors, loop over the list with map check if there are any numeric converted values are greater than 70
library(dplyr)
library(purrr)
df %>%
mutate(Y = map_int(strsplit(X, ","), ~ +(any(as.numeric(.x) > 70))))

Sort rows of data frame by shifting the rows so that the maximum value is on the top

I have a data frame like below, values of which needs to be sorted.
Name Bin Value
a 1 10
a 2 1000
a 3 1
a 4 100
b 1 20
b 2 2
b 3 200
b 4 2000
I wish that the maximum value goes to the top with keeping the relative position of values to the other values, so that the new order looks like below.
Name Bin Value
a 1 1000
a 2 1
a 3 100
a 4 10
b 1 2000
b 2 20
b 3 2
b 4 200
It is not just bring the maximum Value to the top, but the whole sequence of Value needs to be shifted with maximum Value like a 1 is always below a 1000 in both old and new data.frame.
Define a function which takes a vector and shifts it upwards moving the maximum to the top and then shifting the values before the maximum to the bottom. Use ave to apply that to Value by Name.
max2top <- function(x) {
wx <- which.max(x) - 1
if (wx == 0) x else c(tail(x, -wx), head(x, wx))
}
transform(DF, Value = ave(Value, Name, FUN = max2top))
giving
Name Bin Value
1 a 1 1000
2 a 2 1
3 a 3 100
4 a 4 10
5 b 1 2000
6 b 2 20
7 b 3 2
8 b 4 200
Note
The input in reproducible form:
Lines <- "Name Bin Value
a 1 10
a 2 1000
a 3 1
a 4 100
b 1 20
b 2 2
b 3 200
b 4 2000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)

How do I identifying the first zero in a group of ordered columns?

I'm trying to format a dataset for use in some survival analysis models. Each row is a school, and the time-varying columns are the total number of students enrolled in the school that year. Say the data frame looks like this (there are time invariate columns as well).
Name total.89 total.90 total.91 total.92
a 8 6 4 0
b 1 2 4 9
c 7 9 0 0
d 2 0 0 0
I'd like to create a new column indicating when the school "died," i.e., the first column in which a zero appears. Ultimately I'd like to have this column be "years since 1989" and can re-name columns accordingly.
A more general version of the question, for a series of time ordered columns, how do I identify the first column in which a given value occurs?
Here's a base R approach to get a column with the first zero (x = 0) or NA if there isn't one:
data$died <- apply(data[, -1], 1, match, x = 0)
data
# Name total.89 total.90 total.91 total.92 died
# 1 a 8 6 4 0 4
# 2 b 1 2 4 9 NA
# 3 c 7 9 0 0 3
# 4 d 2 0 0 0 2
Here is an option using max.col with rowSums
df1$died <- max.col(!df1[-1], "first") * NA^!rowSums(!df1[-1])
df1$died
#[1] 4 NA 3 2

R saving the output of table() into a data frame

I have the following data frame:
id<-c(1,2,3,4,1,1,2,3,4,4,2,2)
period<-c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df<-data.frame(id,period)
typing
table(df)
results in
period
id calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
however if I save it as a data frame 'df'
df<-data.frame(table(df))
the format of 'df' would be like
id period Freq
1 1 calib 2
2 2 calib 1
3 3 calib 1
4 4 calib 0
5 1 first 1
6 2 first 2
7 3 first 0
8 4 first 0
9 1 valid 0
10 2 valid 0
11 3 valid 2
12 4 valid 3
how can I avoid this and how can I save the first output as it is into a data frame?
more importantly is there any way to get the same result using 'dcast'?
Would this help?
> data.frame(unclass(table(df)))
calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
To elaborate just a little bit. I've changed the ids in the example data.frame such that your ids are not 1:4, in order to prove that the ids are carried along into the table and are not a sequence of row counts.
id <- c(10,20,30,40,10,10,20,30,40,40,20,20)
period <- c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df <- data.frame(id,period)
Create the new data.frame one of two ways. rengis answer is fine for 2-column data frames that have the id column first. It won't work so well if your data frame has more than 2 columns, or if the columns are in a different order.
Alternative would be to specify the columns and column order for your table:
df3 <- data.frame(unclass(table(df$id, df$period)))
the id column is included in the new data.frame as row.names(df3). To add it as a new column:
df3$id <- row.names(df3)
df3
calib first valid id
10 1 2 0 10
20 2 0 2 20
30 0 0 2 30
40 1 1 1 40

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own
Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.
I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Resources