Combining Dataframes in R - r

I have two data frames, A and B, which both contain the same columns.
A = person, place
1 , 2
1 , 3
B = person, place
2 , 4
2 , 3
I want to combine both tables into one.
NewTable = person, place
1 , 2
1 , 3
2 , 4
2 , 3
Any ideas how?

Related

Combining several columns based on matching text in R

I ran a study in Qualtrics with 4 conditions. I'm only including 3 in the example below for ease. The resulting data looks something like this:
condition Q145 Q243 Q34 Q235 Q193 Q234 Q324 Q987 Q88
condition How a? How b? How c? How a? How b? How c? How a? How b? How c?
1 3 5 2
1 5 4 7
1 3 1 4
2 3 4 7
2 1 2 8
2 1 3 9
3 7 6 5
3 8 1 3
3 9 2 2
The questions in the 2nd row are longer and more complex in the actual dataset, but they are consistent across conditions. In this sample, I've tried to capture the consistency and the fact that the default variable names (all starting with Q) do not match up.
Ultimately, I would like a dataframe that looks like the following. I would like to consolidate all the responses to a single question into one column per question. (Then I will go in and rename the lengthy questions with more concise variable names and "tidy" the data.)
condition How a? How b? How c?
1 3 5 2
1 5 4 7
1 3 1 4
2 3 4 7
2 1 2 8
2 1 3 9
3 7 6 5
3 8 1 3
3 9 2 2
I'd appreciate any ideas for how to accomplish this.
library(tidyverse)
file = 'condition,Q145 ,Q243 ,Q34 ,Q235 ,Q193 ,Q234 ,Q324 ,Q987 ,Q88
condition,How a?,How b?,How c?,How a?,How b?,How c?,How a?,How b?,How c?
1 ,3 ,5 ,2 , , , , , ,
1 ,5 ,4 ,7 , , , , , ,
1 ,3 ,1 ,4 , , , , , ,
2 , , , ,3 ,4 ,7 , , ,
2 , , , ,1 ,2 ,8 , , ,
2 , , , ,1 ,3 ,9 , , ,
3 , , , , , , , 7 , 6 , 5
3 , , , , , , , 8 , 1 , 3
3 , , , , , , , 9 , 2 , 2'
# Read in just the data without the weird header situation
data <- read_csv(file, col_names = FALSE, skip = 2)
# Pull out the questions row and reshape into a dataframe to make the next part easy
questions <- gather(read_csv(file, col_names = FALSE, skip = 1, n_max = 1))
# Generate list of data frames (one df for each question)
split(questions, questions$value) %>%
# Then coalesce the columns
map_df(~do.call(coalesce, data[, .x$key]))
Gives the following result:
# A tibble: 9 x 4
condition `How a?` `How b?` `How c?`
<int> <int> <int> <int>
1 1 3 5 2
2 1 5 4 7
3 1 3 1 4
4 2 3 4 7
5 2 1 2 8
6 2 1 3 9
7 3 7 6 5
8 3 8 1 3
9 3 9 2 2
Of course, if you intend to move to long format eventually, you might just do something like this:
data %>%
gather(key, answer, -X1) %>%
filter(!is.na(answer)) %>%
left_join(questions, by = 'key') %>%
select(condition = X1, question = value, answer)
Resulting in the following:
# A tibble: 27 x 3
condition question answer
<int> <chr> <int>
1 1 How a? 3
2 1 How a? 5
3 1 How a? 3
4 1 How b? 5
5 1 How b? 4
6 1 How b? 1
7 1 How c? 2
8 1 How c? 7
9 1 How c? 4
10 2 How a? 3
# ... with 17 more rows

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own
Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.
I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Resources