Joining two datasets in R - r

I got 2 Dataset that I want to combine
Dataset_1:
id| value_1
1 | a
1 | b
1 | b
2 | a
2 | a
2 | b
...
Dataset_2:
id| value_2
1 | 123
1 | 433
1 | 234
2 | 222
2 | 333
2 | 333
...
and the result should look like:
id| value_1 | value 2
1 | a | 123
1 | b | 433
1 | b | 234
2 | a | 222
2 | a | 333
2 | b | 333
if tried to use these functions:
inner_join(dataset_1,dataset_2,by="id")
and
full_join(dataset_1,dataset_2,by="id")
and
merge(dataset_1,dataset_2,by="id")
but i always get all possible combinations of the 2 datasets and not the combined one.
It should be simple but I can't figure out what I am doing wrong.
id is a double, value_1 is a chr and value_2 is an int.
Thanks for any help!

Your example displays the need for a bind not a join.
Dataset_3 <- bind_cols(Dataset_1,Dataset_2[-1] )
What happening is:
When a join finds a repeated id, it creates more cases for each combination of results.

Related

Subsetting a table in R

In R, I've created a 3-dimensional table from a dataset. The three variables are all factors and are labelled H, O, and S. This is the code I used to simply create the table:
attach(df)
test <- table(H, O, S)
Outputting the flattened table produces this table below. The two values of S were split up, so these are labelled S1 and S2:
ftable(test)
+-----------+-----------+-----+-----+
| H | O | S1 | S2 |
+-----------+-----------+-----+-----+
| Isolation | Dead | 2 | 15 |
| | Sick | 64 | 20 |
| | Recovered | 153 | 379 |
| ICU | Dead | 0 | 15 |
| | Sick | 0 | 2 |
| | Recovered | 1 | 9 |
| Other | Dead | 7 | 133 |
| | Sick | 4 | 20 |
| | Recovered | 17 | 261 |
+-----------+-----------+-----+-----+
The goal is to use this table object, subset it, and produce a second table. Essentially, I want only "Isolation" and "ICU" from H, "Sick" and "Recovered" from O, and only S1, so it basically becomes the 2-dimensional table below:
+-----------+------+-----------+
| | Sick | Recovered |
+-----------+------+-----------+
| Isolation | 64 | 153 |
| ICU | 0 | 1 |
+-----------+------+-----------+
S = S1
I know I could first subset the dataframe and then create the new table, but the goal is to subset the table object itself. I'm not sure how to retrieve certain values from each dimension and produce the reduced table.
Edit: ANSWER
I now found a much simpler method. All I needed to do was reference the specific columns in their respective directions. So a much simpler solution is below:
> test[1:2,2:3,1]
O
H Sick Healed
Isolation 64 153
ICU 0 1
Subset the data before running table, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2

R - Convert a column to row headers and populate the presence of that header to a true/false for each record [duplicate]

This question already has answers here:
Reshape from long to wide and create columns with binary value
(3 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a data frame that looks like this:
+-----------+------------+-----------+-----+----------------+
| Unique ID | First Name | Last Name | Age | Characteristic |
+-----------+------------+-----------+-----+----------------+
| 1 | Bob | Smith | 25 | Intelligent |
| 1 | Bob | Smith | 25 | Funny |
| 1 | Bob | Smith | 25 | Short |
| 2 | Jim | Murphy | 62 | Tall |
| 2 | Jim | Murphy | 62 | Funny |
| 3 | Kelly | Green | 33 | Tall |
+-----------+------------+-----------+-----+----------------+
I want to convert the "Characteristic" column into a row header, and for the present of that characteristic in each record populate it with a 1 if they have it or a 0 if they don't, such that I only have 1 row per record and my output looks like:
+-----------+------------+-----------+-----+-------------+-------+-------+------+
| Unique ID | First Name | Last Name | Age | Intelligent | Funny | Short | Tall |
+-----------+------------+-----------+-----+-------------+-------+-------+------+
| 1 | Bob | Smith | 25 | 1 | 1 | 1 | 0 |
| 2 | Jim | Murphy | 62 | 0 | 1 | 0 | 1 |
| 3 | Kelly | Green | 33 | 0 | 0 | 0 | 1 |
+-----------+------------+-----------+-----+-------------+-------+-------+------+
A more consumable data, and a solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
read.table(header=TRUE, stringsAsFactors=FALSE, text="
Unique_ID First_Name Last_Name Age Characteristic
1 Bob Smith 25 Intelligent
1 Bob Smith 25 Funny
1 Bob Smith 25 Short
2 Jim Murphy 62 Tall
2 Jim Murphy 62 Funny
3 Kelly Green 33 Tall") %>%
mutate(v = 1L) %>%
tidyr::spread(Characteristic, v, fill=0L)
# Unique_ID First_Name Last_Name Age Funny Intelligent Short Tall
# 1 1 Bob Smith 25 1 1 1 0
# 2 2 Jim Murphy 62 1 0 0 1
# 3 3 Kelly Green 33 0 0 0 1
Most of the work was done with spread. Unfortunately, this has NA instead of 0 for all of the empty spots. If you can live with it, you're good. (Edited based on #www's suggestion.)
Here is another tidyverse solution.
df %>%
mutate(ind = 1L) %>%
spread(Characteristic, ind, fill = 0L)
# Unique.ID First.Name Last.Name Age Funny Intelligent Short Tall
# 1 1 Bob Smith 25 1 1 1 0
# 2 2 Jim Murphy 62 1 0 0 1
# 3 3 Kelly Green 33 0 0 0 1
You can also use reshape2 to account for the case when there are more than 1 instance of each case.
library(reshape2)
dcast(df, ...~Characteristic, fun.aggregate = length)
The data
df <- read.table(text = "Unique ID | First Name | Last Name | Age | Characteristic
1 | Bob | Smith | 25 | Intelligent
1 | Bob | Smith | 25 | Funny
1 | Bob | Smith | 25 | Short
2 | Jim | Murphy | 62 | Tall
2 | Jim | Murphy | 62 | Funny
3 | Kelly | Green | 33 | Tall ", sep = "|", header = T, strip.white = T, stringsAsFactors = F)

R data.table add new column with query for each row

I have 2 R data.tables in R like so:
first_table
id | first | trunc | val1
=========================
1 | Bob | Smith | 10
2 | Sue | Goldm | 20
3 | Sue | Wollw | 30
4 | Bob | Bellb | 40
second_table
id | first | last | val2
==============================
1 | Bob | Smith | A
2 | Bob | Smith | B
3 | Sue | Goldman | A
4 | Sue | Goldman | B
5 | Sue | Wollworth | A
6 | Sue | Wollworth | B
7 | Bob | Bellbottom | A
8 | Bob | Bellbottom | B
As you can see, the last names in the first table are truncated. Also, the combination of first and last name is unique in the first table, but not in the second. I want to "join" on the combination of first name and last name under the incredibly naive assumptions that
first,last uniquely defines a person
that truncation of the last name does not introduce ambiguity.
The result should look like this:
id | first | trunc | last | val1
=======================================
1 | Bob | Smith | Smith | 10
2 | Sue | Goldm | Goldman | 20
3 | Sue | Wollw | Wollworth | 30
4 | Bob | Bellb | Bellbottom | 40
Basically, for each row in table_1, I need to find a row that back fills the last name.
For Each Row in first_table:
Find the first row in second_table with:
matching first_name & trunc is a substring of last
And then join on that row
Is there an easy vectorized way to accomplish this with data.table?
One approach is to join on first, then filter based on the substring-match
first_table[
unique(second_table[, .(first, last)])
, on = "first"
, nomatch = 0
][
substr(last, 1, nchar(trunc)) == trunc
]
# id first trunc val1 last
# 1: 1 Bob Smith 10 Smith
# 2: 2 Sue Goldm 20 Goldman
# 3: 3 Sue Wollw 30 Wollworth
# 4: 4 Bob Bellb 40 Bellbottom
Or, do the truncation on the second_table to match the first, then join on both columns
first_table[
unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
, on = c("first", "trunc")
, nomatch = 0
]
## yields the same answer

Why won't my column name change work in R?

This is part of a script im writing to merge the collumns more fully after using merge().
If both datasets have a column with the same name merge() gives you columns column.x and column.y. I have written a script to put this data together and to drop the unneeded columns (which would be column.y and column.x_error, a column i've added to give warnings in case dat$column.x != dat$column.y). I also want to rename column.x to column, to decrease unneeded manual actions in my dataset. I have not managed to rename column.x to column, see the code for more info.
dat is obtained from doing a dat = merge(data1,data2, by= "ID", all.x=TRUE)
#obtain a list of double columns
dubbelkol = cbind()
sorted = sort(names(dat))
for(i in as.numeric(1:length(names(dat)))) {
if(grepl(".x",sorted[i])){
if (grepl(".y", sorted[i+1]) && (sub(".x","",sorted[i])==sub(".y","",sorted[i+1]))){
dubbelkol = cbind(dubbelkol,sorted[i],sorted[i+1])
}
}
}
#Check data, fill in NA in column.x from column.y if poss
temp = cbind()
for (p in as.numeric(1:(length(dubbelkol)-1))){
if(grepl(".x",dubbelkol[p])){
dat[dubbelkol[p]][is.na(dat[dubbelkol[p]])] = dat[dubbelkol[p+1]][is.na(dat[dubbelkol[p]])]
temp = (dat[dubbelkol[p]] != dat[dubbelkol[p+1]])
colnames(temp) = (paste(dubbelkol[p],"_error", sep=""))
dat[colnames(temp)] = temp
}
}
#If every value in "column.x_error" is TRUE or NA, delete "column.y" and "column.x_error"
#Rename "column.x" to "column"
#from here until next comment everything works
droplist= c()
for (k in as.numeric(1:length(names(dat)))) {
if (grepl(".x_error",colnames(dat[k]))) {
if (all(dat[k]==FALSE, na.rm = TRUE)) {
droplist = c(droplist,colnames(dat[k]), sub(".x_error",".y",colnames(dat[k])))
#the next line doesnt work, it's supposed to turn the .x column back to "" before the .y en .y_error columns are dropped.
colnames(dat[sub(".x_error",".x",colnames(dat[k]))])= paste(sub(".x_error","",colnames(dat[k])))
}
}
}
dat = dat[,!names(dat) %in% droplist]
paste(sub(".x_error","",colnames(dat[k]))) will give me "BNR" just fine, but the colnames(...) = ... won't change the column name in dat.
Any idea what's going wrong?
data1
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | NA |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
data2
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
dat
+----+-------+-------+-----------+
| ID | BNR.x | BNR.y |BNR.x_error|
+----+-------+-------+-----------+
| 1 | 123 | NA |FALSE |
| 2 | 234 | 234 |FALSE |
| 3 | NA | 345 |FALSE |
| 4 | 456 | 456 |FALSE |
| 5 | 677 | 677 |FALSE |
| 6 | NA | NA |NA |
+----+-------+-------+-----------+
desired output
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
I suggest replacing:
sub(".x_error",".x",colnames(dat[k]))]
with:
sub("\\.x_error", "\\.x", colnames(dat[k]))]
if you wish to replace an actual .. You have to escape . with \\.. A . in regex means any character.
Even better, since you are replacing . with . why not just say:
sub("x_error", "x", colnames(dat[k]))]
(or) if there is no other _error other than x_error, simply:
sub("_error", "", colnames(dat[k]))]
Edit: The problem seems to be that your data format seems to be loading additional columns on the left and the right. You can select the columns you want first and then merge.
d1 <- read.table(textConnection("| ID | BNR |
| 1 | 123 |
| 2 | 234 |
| 3 | NA |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |"), sep = "|", header = TRUE, stringsAsFactors = FALSE)[,2:3]
d1$BNR <- as.numeric(d1$BNR)
d2 <- read.table(textConnection("| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |"), header = FALSE, sep = "|", stringsAsFactors = FALSE)[,2:3]
names(d2) <- c("ID", "BNR")
d2$BNR <- as.numeric(d2$BNR)
# > d1
# ID BNR
# 1 1 123
# 2 2 234
# 3 3 NA
# 4 4 456
# 5 5 677
# 6 6 NA
# > d2
# ID BNR
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 456
# 5 5 677
# 6 6 NA
dat <- merge(d1, d2, by="ID", all=T)
> dat
# ID BNR.x BNR.y
# 1 1 123 123
# 2 2 234 234
# 3 3 NA 345
# 4 4 456 456
# 5 5 677 677
# 6 6 NA NA
# replace all NA values in x from y
dat$BNR.x <- ifelse(is.na(dat$BNR.x), dat$BNR.y, dat$BNR.x)
# now remove y
dat$BNR.y <- null

SQLite join selection from the same table as two columns

I have one table 'positions' with columns:
id | session_id | keyword_id | position
and some rows in it:
10 rows with session_id = 1
and 10 with session_id = 2.
As a result of the query I need a table like this:
id | keyword_id | position1 | position2
where 'position1' is a column with values that had session_id = 1 and 'position2' is a column with values that had session_id = 2.
The result set should contain 10 records.
Sorry for my bad English.
Data examle:
id | session_id | keyword_id | position
1 | 1 | 1 | 2
2 | 1 | 2 | 3
3 | 1 | 3 | 0
4 | 1 | 4 | 18
5 | 2 | 5 | 9
6 | 2 | 1 | 0
7 | 2 | 2 | 14
8 | 2 | 3 | 2
9 | 2 | 4 | 8
10 | 2 | 5 | 19
Assuming that you wish to combine positions with the same id, from the two sessions, then the following query should to the trick:
SELECT T1.keyword_id
, T1.position as Position1
, T2.position as Position2
FROM positions T1
INNER JOIN positions T2
ON T1.keyword_id = T2.keyword_id -- this will match positions by [keyword_id]
AND T1.session_id = 1
AND T2.session_id = 2

Resources