I have data in a dataframe with 139104 rows which is multiple of 96x1449. i have a phenotype file which contains the phenotype information for the 96 samples. the snp name is repeated 1449X96 samples. I haveto merge the two dataframes based on sid and sen. this is how my two dataframes look like
dat <- data.frame(
snpname=rep(letters[1:12],12),
sid=rep(1:12,each=12),
genotype=rep(c('aa','ab','bb'), 12)
)
pheno <- data.frame(
sen=1:12,
disease=rep(c('N','Y'),6),
wellid=1:12
)
I have to merge or add the disease column and 3 other columns to the data file. I am unable to use merge in R. I have searched google, i am not hitting the correct terms to get the answer. I would appreciate any input on this issue.
Thanks, Sharad
You can specify the columns you want to match on directly with merge():
merge(dat, pheno, by.x = "sid", by.y = "sen")
Related
I'm currently working on R on a survey on schools and I would like to add a variable with the population of the city the school is in.
In the first data set I have all the survey respondants which includes a variable "city_name". I have managed to find online a list of the cities with their population which I have imported on R.
What I now would like to do is to add a variable in dataset_1 called city_pop which is equal to the city population when city_name is in both data sets. It might be relevant to know that the first dataset has around 1200 rows while the second one has around 36000 rows.
I've tried several things including the following:
data_set_1$Pop_city = ifelse(data_set_1$city_name == data_set_2$city_name, data_set_2$Pop_city, 0)
Any clues?
Thanks!!
You need to merge the two dataset:
new_df <- merge(data_set_1, data_set_2, by="city_name")
The result will be a dataframe containing only matching rows (in your case, 1200 rows assuming that all cities in data_set_1 are also in data_set_2) and all columns of both data frames. If you want to also keep non-matching rows of data_set_1, you can use the all.x option:
new_df <- merge(data_set_1, data_set_2, by="city_name", all.x=TRUE)
Two ways you could try using dplyr:
library(dplyr)
data_set_1 %>%
mutate(Pop_city = ifelse(city_name %in% data_set_2$city_name,
data_set_2$Pop_city[city_name == data_set_2$city_name],
0))
or using a left_join
data_set_1 %>%
left_join(data_set_2, by = "city_name")
perhaps followed by a select(names(data_set_1), Pop_city).
I have 2 dataframes: one has my data and another is like a vlookup table (called PlanMap), where I look up the LeavePlanCode in the PlanMap to get the Plan.Type.
To perform the vlookup, I have done the following code:
data <- merge(data, PlanMap[,c(1:2)], by.x = "LeavePlanCode", by.y = "From.Raw.Data", all.x = TRUE)
where From.Raw.Data is the same thing as the LeavePlanCode, just in the main dataset.
The merge works correctly for 415,000 of my rows, however 4,000 of them are showing up as NA. These NAs narrow down to 3 (of 150) LeavePlanCode which are also in my PlanMap, so they should be pulled in, just like the rest of the 150 plan codes are.
Can anyone explain why the vlookup on these 3 isn't working??
Thanks!
I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps
I have a major dataframe and 70 smaller dataframes. I would like to merge each of the 70 dataframes in the list with the main dataframe (so I end up with one huge dataframe).
The small dataframes all have 2 columns, but all have a different number of rows. I would like to do two things:
1)Change the name of the second column (by adding a character to the column 1 name)
2) Then merge that small column to the main column.
Here is what I have in mind:
colnames(SmallDf1)[2] <- paste(colnames(SmallDf1)[1], "_1")
main_Df <- merge(main_Df, SmallDf1, by.x = "NAME", by.y = names(SmallDf1)[1], all.x = TRUE)
This is obviously, just for a single data frame. Does anyone have ideas for doing this for all 70 dataframes? Any help appreciated!
I have recently restarted to use R, and I'm trying to compare two excel tables (let's call them table 1 and 2), with very different data. The only common point is situated in one column (let's name it col1), and is the gene ID.
My goal is to find and keep all the rows of table 1 in which the data in col1 is exactly matching the data in table2.
For example if table1 contains 10 columns and col1 contains geneID. Table2 contains only 5 columns and col2 contains geneID. I want to compare and keep matching information of those two columns and get a data.frame containing the whole rows of table1 that I want to keep.
I hope I'm clear? English is not my first language ^^
Thanks a lot !
merge(x = table1,
y = table2,
by.x = "column_name_table1",
by.y = "column_name_table2",
all.x = T)