I have a data frame concerning the purchases of a shop owner. They don't happen on a daily basis.
It has two columns: the first one describes the date, the second one the quantity bought in that date.
I would like to transform it into daily data, completing the original dataset, so I created a sequence:
a <- seq(as.Date("2013/11/19"), as.Date("2017/04/22"), "days")
The first date corresponds to the one of the first purchase, and the second one of the last one, of the original dataset.
The classes are both "Date".
How can I merge the two dataset by "date", even if, obviously, they have different rows length? I would like to have a dataframe with daily "Date" as first column, and "Quantity" on the second one, with zeros where purchases didn't happen.
Using base R:
# create sample data frame with sales data
test <- data.frame(date = as.Date(c("2017/08/12", "2017/08/15", "2017/09/02")), quantity = c(3,2,1))
# create the date range
dates <- data.frame(date = seq(min(test$date), max(test$date), by = "day"))
# perform the left join
# (keeping all rows from "dates", and joining the sales dataset to them)
result <- merge(dates, test, by.y = "date", by.x = "date", all.x = TRUE)
In the merge function, by.y and by.x are the columns used to join the dataset, while all.x tells you, that all rows from x (in this case dates) should be kept in the resulting data frame.
Related
I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps
I have two sets of data, each in a separate data frame. This is because one is derived from an Excel spreadsheet and the other by automatic iteration of raw data files.
Both data frames have one thing in common: a first column containing a uniform timestamp information for the observations in them. df1 contains data on humidity and temperature (variables: timestamp, hum, temp), and df2 contains an oxygen, power and a time variable (variables: timestamp, O2, power, time).
Ideally, both df1 should contain all timestamped observations that df2 contains as well. Additionally, df1 contains some extra observations that need to be cleansed.
I would like to "join" both data frames, such that for each timestamp, all variable values from both df are joined (i.e. variables: timestamp, hum, temp, O2, power, time). Those observations that only occur in df1 should be removed.
Is there any smart way of doing that?
Kind regards
kruemelprinz
Seems like you're just looking for a simple left_join. This can be done via dplyr with
left_join(df2, df1)
which will only return rows where df2 and df1 match in the timestamp column. (This drops all of the extra observations in df1).
A base R implementation is:
merge(x = df2, y = df1, by = "timestamp", all.x = TRUE)
i'm new in R , and i'm trying to join between two tables. the shared filed between the two tables is the date but when i'm importing the data i received him with deferent structure.
First Table:
Second Table:
actually what i need is to join the data by operation system and remove Linux like inner join in sql with condition on the operation system. Thanks
Say that your first dataset is called df1 and the second one df2, you can join the two by calling:
merge(df1, df2, by = "operatingSystem")
You can specify the kinds of join by using all = T, all.x = T, or all.y = T.
I am a bit lazy to reproduce your example but I will give it a go as is
First, in your second table, you need to convert the date column to an actual date
You can do this with easily with lubridate
Assuming df1 and df2 for the first and second table respectively
library(lubridate)
df2$date <- ymd(df2$date) #ymd function assumes `year` then `month` then `day` when converting
Then you can use the dplyr's inner_join to perform the desired join
from stat545
inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
library(dplyr)
semi_join(df1, df2, by = c("date", "operatingSystem")
This will keep all rows in df1 that have a match in df2 - Linux stays out, and also keep the columns newusers and will keep df2%users and rename into users.1.
Note:You might need to convert df1$date to dttm object with lubridate::date(df1$date)
I have two structural identical dataframes: column id-part1, column id-part2 and column data1.
id-part1 and id-part2 are together used as an index-
Now I want to calculate the difference between the two dataframes of column data1 with respect to the two id columns. In fact, in one data-frames it might happen that the combination of id-part1 and id-part2 is not existing...
So it is somehow a SQL join operation, ins't?
The merge() function is what you are looking for.
It works similar as an SQL join operation. Given your description a solution would be:
solution <- merge(DF1, DF2, by = c('id-part1', 'id-part2'), all.x = TRUE, all.y = TRUE)
DF1 and DF2 are your corresponding data frames. merge() uses x and y to reference these data frames where x is the first (DF1) and y the second (DF2).
The by= property defines the column names to match (you can even specify different names for each data frame).
all.x and all.y specify the kind of join you like to perform, depending on the data you like to keep.
The result is a new data frame with different columns for data1. You can then continue with your calculations.
I have data in a dataframe with 139104 rows which is multiple of 96x1449. i have a phenotype file which contains the phenotype information for the 96 samples. the snp name is repeated 1449X96 samples. I haveto merge the two dataframes based on sid and sen. this is how my two dataframes look like
dat <- data.frame(
snpname=rep(letters[1:12],12),
sid=rep(1:12,each=12),
genotype=rep(c('aa','ab','bb'), 12)
)
pheno <- data.frame(
sen=1:12,
disease=rep(c('N','Y'),6),
wellid=1:12
)
I have to merge or add the disease column and 3 other columns to the data file. I am unable to use merge in R. I have searched google, i am not hitting the correct terms to get the answer. I would appreciate any input on this issue.
Thanks, Sharad
You can specify the columns you want to match on directly with merge():
merge(dat, pheno, by.x = "sid", by.y = "sen")