I have 2 dataframes: one has my data and another is like a vlookup table (called PlanMap), where I look up the LeavePlanCode in the PlanMap to get the Plan.Type.
To perform the vlookup, I have done the following code:
data <- merge(data, PlanMap[,c(1:2)], by.x = "LeavePlanCode", by.y = "From.Raw.Data", all.x = TRUE)
where From.Raw.Data is the same thing as the LeavePlanCode, just in the main dataset.
The merge works correctly for 415,000 of my rows, however 4,000 of them are showing up as NA. These NAs narrow down to 3 (of 150) LeavePlanCode which are also in my PlanMap, so they should be pulled in, just like the rest of the 150 plan codes are.
Can anyone explain why the vlookup on these 3 isn't working??
Thanks!
Related
Applying mvn to iris requires subsetting by the Species variable.
However, the result of this package is a nested of named lists, finally containing a dataframe with p-values and metrics for each subset.
I find this uncomfortable to read and to process, so I want to merge all this almost equally named tables into one.
I haven't a found a merge function for multiple tables at once.
Therefore, I tried with Reduce() as suggested in other similar questions.
The first applies a merge() correctly to pairs of dataframes.
merging is done with merge() with 2 tables at the time.
The problems are:
Naming of the classes (Species) are lost
Results gets mixed
I am forced to specify either univariate o multivariate, and call 2 times.
I would like just a resulting dataframe with, 1 column for the classes, 1 for the univariate/multivariate, and the rest merged by column names.
mvn_results = MVN::mvn( iris, subset='Species', mvnTest = "hz" )
mvn_results
Resulting:
$multivariateNormality
$multivariateNormality$setosa
$multivariateNormality$versicolor
$multivariateNormality$virginica
$univariateNormality
$univariateNormality$setosa
$univariateNormality$versicolor
$univariateNormality$virginica
$Descriptives
$Descriptives$setosa
$Descriptives$versicolor
$Descriptives$virginica
And tables with repeated structure like this:
I tried this:
mvn_merged <- Reduce(function(x, y)
merge( x, y, all=TRUE), mvn_results$univariateNormality )
mvn_merged
which produced the next result:
I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps
I'm usually a SAS user but was wondering if there was a similar way in R to list data that can only be found in one data frame after merging them. In SAS I would have used
data want;
merge have1 (In=in1) have2 (IN=in2) ;
if not in2;
run;
to find the entries only in have1.
My R code is:
inner <- merge(have1, have2, by= "Date", all.x = TRUE, sort = TRUE)
I've tried setdiff() and antijoin() but neither seem to give me what I want. Additionally, I would like to find a way to do the converse of this. I would like to find the entries in have1 and have2 that have the same "Date" entry and then keep the remaining variables in the 2 data frames. For example, consider have1 with columns "Date", "ShotHeight", "ShotDistance" and have2 with columns "Date", "ThrowHeight", "ThrowDistance" so that the m]new dataframe, call it "new" has columns "Date", ShotHeight", "ShotDistance", "ThrowHeight", "ThrowDistance".
Assuming only one by-variable, the simplest solution is not to merge at all:
want <- subset(have1, !(county %in% have2$county))
This subsets have1 to exclude rows where the value of county is in have2.
I'm trying to merge two datasets in R. The 1st dataset is called AcademicData and the other one is called Mathsdata. When I merge the datasets, I'm getting thousands of duplicate rows. Here a pic of the code and the resulting merge table called total. I'm trying to merge the datasets by the variable "gender".
Heres the code.
setwd("H:/Data application/x14484252-DAD Project")
MathsData <- read.csv("Math-Students.csv", header=T, na.strings=c(""),
stringsAsFactors = T)
AcademicData <- read.csv("Academic-Performance.csv", header=T,
na.strings=c(""), stringsAsFactors = T)
total <- merge(MathsData, AcademicData, by="gender", all.x=TRUE)
As you can see from the image, there are 93,435 rows being created from the merge in the table called total.Table
Heres an image of the each the 1st dataset in excel.
Academic Dataset
Here an image of the second dataset in excel.
MathsData
I want to merge the two datasets by gender, without duplicate rows being created in the table called total.
You could do this:
library(data.table)
setDT(MathsData); setDT(AcademicData)
MathsData[AcademicData, mult = "first", on = "gender", nomatch=0L]
Since you did not provide a reproducible data, I couldn't test the code. But I think this shall work well.
I have data in a dataframe with 139104 rows which is multiple of 96x1449. i have a phenotype file which contains the phenotype information for the 96 samples. the snp name is repeated 1449X96 samples. I haveto merge the two dataframes based on sid and sen. this is how my two dataframes look like
dat <- data.frame(
snpname=rep(letters[1:12],12),
sid=rep(1:12,each=12),
genotype=rep(c('aa','ab','bb'), 12)
)
pheno <- data.frame(
sen=1:12,
disease=rep(c('N','Y'),6),
wellid=1:12
)
I have to merge or add the disease column and 3 other columns to the data file. I am unable to use merge in R. I have searched google, i am not hitting the correct terms to get the answer. I would appreciate any input on this issue.
Thanks, Sharad
You can specify the columns you want to match on directly with merge():
merge(dat, pheno, by.x = "sid", by.y = "sen")