I have two datasets, containing information on Airbnb listings, based on those listings' IDs. The first dataset, "calendar", includes for every ID and every date for 365 days ahead, the price and the availability of the listing. It has 4159641 rows and 4 columns.
The second data set, "Listings", includes for those same IDs several characteristics, like longitude, latitude, capacity, etc. It has 8903 rows and 9 variables.
Based on those common IDs I would like to combine the two datasets, so that all the information of the second data set "Listings" will be included to the first on "calendar". More precisely for every row of x listing data and price I want to include the information about longitude, latitude, capacity, etc. The dataset would then have 4159641 rows and 12 columns.
I would be really grateful to anyone who helps me with that.
Thank you!
calendar datasetListing dataset
You could try the following:
library(dplyr)
calendar <- read.csv2(...)
listings <- read.csv2(...)
joined_data <- inner_join(calendar, listings, by="ID")
The gereral usage is, as follows:
join_type(first_data_set, second_data_set, by=column_to_join_on)
Be aware of the join_type:
inner_join, will combine first and second tables based on the join-predicate
left_join, will take all of the values from first_data_set and match them to records from second_data_set, in case of no matches NULL will appear
right_join, is the opposite of left_join
...,
There are more, you can check them by yourself in the package. But, the right one for you might be either inner_join or left_join.
That's a left join since you want as many rows as there are observations in df1. Many ways to do that:
Base R
This also works with a data.table object (merge is extended for this class of objects)
merge(df1, df2, all.x = TRUE, by = 'ID)
dplyr
library(dplyr)
df1 %>% left_join(df2, by = 'ID')
I advice you to have a look at this post where you can find discussions on other types of join (inner, right...)
Another option is data.table
library(data.table)
setDT(df1)[dfd2, on = .(ID)]
Related
I'm currently working on R on a survey on schools and I would like to add a variable with the population of the city the school is in.
In the first data set I have all the survey respondants which includes a variable "city_name". I have managed to find online a list of the cities with their population which I have imported on R.
What I now would like to do is to add a variable in dataset_1 called city_pop which is equal to the city population when city_name is in both data sets. It might be relevant to know that the first dataset has around 1200 rows while the second one has around 36000 rows.
I've tried several things including the following:
data_set_1$Pop_city = ifelse(data_set_1$city_name == data_set_2$city_name, data_set_2$Pop_city, 0)
Any clues?
Thanks!!
You need to merge the two dataset:
new_df <- merge(data_set_1, data_set_2, by="city_name")
The result will be a dataframe containing only matching rows (in your case, 1200 rows assuming that all cities in data_set_1 are also in data_set_2) and all columns of both data frames. If you want to also keep non-matching rows of data_set_1, you can use the all.x option:
new_df <- merge(data_set_1, data_set_2, by="city_name", all.x=TRUE)
Two ways you could try using dplyr:
library(dplyr)
data_set_1 %>%
mutate(Pop_city = ifelse(city_name %in% data_set_2$city_name,
data_set_2$Pop_city[city_name == data_set_2$city_name],
0))
or using a left_join
data_set_1 %>%
left_join(data_set_2, by = "city_name")
perhaps followed by a select(names(data_set_1), Pop_city).
I have a wide dataset which makes it really difficult to manipulate the data in the way I need. It looks like the dummy table below:
Dummy_table_unsorted
Essentially, as seen in the table, the information held in 1 row is at a user level, you have a user id and then all the animals owned by each user are in this row. What I would like it, I want this at animal level, so that a user can have multiple entries, which represent each of their different animals. I have pasted a table below of what I would like it to look like:
Dummy_table_sorted
Is there a simple way to do this? I have an idea as to how, but it is very long winded. I thought to maybe subset by selected columns relating to one animal only and merge the datasets back together. The problem is, in may data, it is possible for one person to have up to 100 animals, which makes this very long winded.
Please can someone offer a suggestion or a package/command that would allow me to change this wide dataset into a long one?
Thank You.
First, you should provide data that someone can easily insert into R. Screenshots are not helpful and increase the amount of work a person needs to perform to help you.
The data as you have it should be able to be split, and recombined with bind_rows or rbind. I would subset the data into three dataframes, rename columns, and bind. Assuming your original data is called df
df1 <- df[,c(1:4)]
df2 <- df[,c(1,5:7)]
df3 <- df[,c(1,8:10)]
# rename columns to match
names(df1) <- c('user id', 'animal', 'colour', 'legs')
names(df2) <- c('user id', 'animal', 'colour', 'legs')
names(df3) <- c('user id', 'animal', 'colour', 'legs')
remade <- bind_rows(df1, df2) %>%
bind_rows(df3)
I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps
I want to merge 2 data frames (data1 and data2). Both initially contain around 35 million observations (around 2GB each).
I removed the duplicates from data2. I would need to keep the duplicates in data 1, as I wish to use them for further calculations per observation in data1.
I initially get the well documented error:
Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1 now has 50 million observations, although is specify all.x=TRUE).
My code is:
#Remove duplicates before merge
data2 <- unique(data2)
#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)
Any advice on how to merge this, is very welcome.
In order to do a left join, the merge statement needs to understand which column you are using as the "key" or "index" for the join. If you have duplicate column names that are used as the key/index, it doesn't know what to do and gives that error. Further, it needs to know what to do if columns are being joined that have the same name as existing columns.
The solution is to temporarily rename the key/index column in your left (data1) dataset As a general rule, having duplicate column names is "bad" in R because it will confuse a lot of functions. Many functions silently call make.unique() to de-duplicate column names to avoid confusion.
If you have duplicate ID columns in data1 change them with colnames(data1) <- make.unique(colnames(data1)), which will set them to ID.1, ID.2, etc. Then do your merge (make sure to specify by.x="ID.1", by.y="ID" because of the rename. By default, duplicate columns that are merged will be appended with .y although you can specify the suffix with the suffixes= option (See Merge helpfile for details)
Lastly, it's worth noting that the merge() function in the data.table package tends to be a lot faster than the base merge() function with similar syntax. Seepage 47 of the data.table manual.
I've run into a questions I can't answer with conditional merging of 2 data frames. Let me describe the data frames (names changed):
The first, DF1, has a column called 'proceduredate' that contains the date of the procedure per instance (already formatted by as.Date in format %Y-%m-%d).
The second, DF2, has a variable called 'orderdate' that contains the date of each lab order (also formatted by as.Date in format %Y-%m-%d).
Each dataframe has an identifier (called 'id') for each individual that is used to merge "by" across the two dataframes. I would like to merge the dataframes conditionally to include only the DF2 instances that have an orderdate within 30 days of the proceduredate in DF1. As I understand it, this would look something like:
if ([abs(DF1$procdate-DF2$orderdate)<=30]), then{
merge(DF1,DF2,by="id")
}
However, I can't figure out a way to turn this idea into working code. Would you suggest any references or similar prior solutions?
SQL handles this better than (base) R - though I believe there's a way to do it in data.table.
library(sqldf)
result = sqldf("
select *
from DF1 left join DF2 on
abs(DF1.procdata - DF2.orderdate) <= 30
AND DF1.id = DF2.id
")
I'm not sure this will work with your dates, maybe if they are Date class columns. If you provide a reproducible example I'm happy to test.