I have two sets of data, each in a separate data frame. This is because one is derived from an Excel spreadsheet and the other by automatic iteration of raw data files.
Both data frames have one thing in common: a first column containing a uniform timestamp information for the observations in them. df1 contains data on humidity and temperature (variables: timestamp, hum, temp), and df2 contains an oxygen, power and a time variable (variables: timestamp, O2, power, time).
Ideally, both df1 should contain all timestamped observations that df2 contains as well. Additionally, df1 contains some extra observations that need to be cleansed.
I would like to "join" both data frames, such that for each timestamp, all variable values from both df are joined (i.e. variables: timestamp, hum, temp, O2, power, time). Those observations that only occur in df1 should be removed.
Is there any smart way of doing that?
Kind regards
kruemelprinz
Seems like you're just looking for a simple left_join. This can be done via dplyr with
left_join(df2, df1)
which will only return rows where df2 and df1 match in the timestamp column. (This drops all of the extra observations in df1).
A base R implementation is:
merge(x = df2, y = df1, by = "timestamp", all.x = TRUE)
Related
I have two datasets, containing information on Airbnb listings, based on those listings' IDs. The first dataset, "calendar", includes for every ID and every date for 365 days ahead, the price and the availability of the listing. It has 4159641 rows and 4 columns.
The second data set, "Listings", includes for those same IDs several characteristics, like longitude, latitude, capacity, etc. It has 8903 rows and 9 variables.
Based on those common IDs I would like to combine the two datasets, so that all the information of the second data set "Listings" will be included to the first on "calendar". More precisely for every row of x listing data and price I want to include the information about longitude, latitude, capacity, etc. The dataset would then have 4159641 rows and 12 columns.
I would be really grateful to anyone who helps me with that.
Thank you!
calendar datasetListing dataset
You could try the following:
library(dplyr)
calendar <- read.csv2(...)
listings <- read.csv2(...)
joined_data <- inner_join(calendar, listings, by="ID")
The gereral usage is, as follows:
join_type(first_data_set, second_data_set, by=column_to_join_on)
Be aware of the join_type:
inner_join, will combine first and second tables based on the join-predicate
left_join, will take all of the values from first_data_set and match them to records from second_data_set, in case of no matches NULL will appear
right_join, is the opposite of left_join
...,
There are more, you can check them by yourself in the package. But, the right one for you might be either inner_join or left_join.
That's a left join since you want as many rows as there are observations in df1. Many ways to do that:
Base R
This also works with a data.table object (merge is extended for this class of objects)
merge(df1, df2, all.x = TRUE, by = 'ID)
dplyr
library(dplyr)
df1 %>% left_join(df2, by = 'ID')
I advice you to have a look at this post where you can find discussions on other types of join (inner, right...)
Another option is data.table
library(data.table)
setDT(df1)[dfd2, on = .(ID)]
I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps
I have a data frame concerning the purchases of a shop owner. They don't happen on a daily basis.
It has two columns: the first one describes the date, the second one the quantity bought in that date.
I would like to transform it into daily data, completing the original dataset, so I created a sequence:
a <- seq(as.Date("2013/11/19"), as.Date("2017/04/22"), "days")
The first date corresponds to the one of the first purchase, and the second one of the last one, of the original dataset.
The classes are both "Date".
How can I merge the two dataset by "date", even if, obviously, they have different rows length? I would like to have a dataframe with daily "Date" as first column, and "Quantity" on the second one, with zeros where purchases didn't happen.
Using base R:
# create sample data frame with sales data
test <- data.frame(date = as.Date(c("2017/08/12", "2017/08/15", "2017/09/02")), quantity = c(3,2,1))
# create the date range
dates <- data.frame(date = seq(min(test$date), max(test$date), by = "day"))
# perform the left join
# (keeping all rows from "dates", and joining the sales dataset to them)
result <- merge(dates, test, by.y = "date", by.x = "date", all.x = TRUE)
In the merge function, by.y and by.x are the columns used to join the dataset, while all.x tells you, that all rows from x (in this case dates) should be kept in the resulting data frame.
i'm new in R , and i'm trying to join between two tables. the shared filed between the two tables is the date but when i'm importing the data i received him with deferent structure.
First Table:
Second Table:
actually what i need is to join the data by operation system and remove Linux like inner join in sql with condition on the operation system. Thanks
Say that your first dataset is called df1 and the second one df2, you can join the two by calling:
merge(df1, df2, by = "operatingSystem")
You can specify the kinds of join by using all = T, all.x = T, or all.y = T.
I am a bit lazy to reproduce your example but I will give it a go as is
First, in your second table, you need to convert the date column to an actual date
You can do this with easily with lubridate
Assuming df1 and df2 for the first and second table respectively
library(lubridate)
df2$date <- ymd(df2$date) #ymd function assumes `year` then `month` then `day` when converting
Then you can use the dplyr's inner_join to perform the desired join
from stat545
inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
library(dplyr)
semi_join(df1, df2, by = c("date", "operatingSystem")
This will keep all rows in df1 that have a match in df2 - Linux stays out, and also keep the columns newusers and will keep df2%users and rename into users.1.
Note:You might need to convert df1$date to dttm object with lubridate::date(df1$date)
I have two structural identical dataframes: column id-part1, column id-part2 and column data1.
id-part1 and id-part2 are together used as an index-
Now I want to calculate the difference between the two dataframes of column data1 with respect to the two id columns. In fact, in one data-frames it might happen that the combination of id-part1 and id-part2 is not existing...
So it is somehow a SQL join operation, ins't?
The merge() function is what you are looking for.
It works similar as an SQL join operation. Given your description a solution would be:
solution <- merge(DF1, DF2, by = c('id-part1', 'id-part2'), all.x = TRUE, all.y = TRUE)
DF1 and DF2 are your corresponding data frames. merge() uses x and y to reference these data frames where x is the first (DF1) and y the second (DF2).
The by= property defines the column names to match (you can even specify different names for each data frame).
all.x and all.y specify the kind of join you like to perform, depending on the data you like to keep.
The result is a new data frame with different columns for data1. You can then continue with your calculations.