Match rownames and colnames of two dataframes - r

I have two dataframes. The row names for one of them (metadata) is the same as the column names for the second (TPM).
I reordered the rownames for metadata in a certain way, now I want to reorder the colnames for TPM accordingly. Both of them should be the same order.
First, I make sure that the rownames in metadata are really the same as the colnames in TPM, any uncommon row or column between the two, will be deleted:
INDEX = intersect(colnames(TPM),rownames(metadata))
metadata = metadata[INDEX,]
TPM = TPM[,INDEX]
Then, I make the reorderring of the metadata rownames (it's according to a feature in the metadata, doesn't really matter):
metadata = metadata %>% arrange(Benefit)
And now comes the problem, I want the TPM colnames order to be the same as the metadata rownames. I want to change the TPM colnames order, I tired this:
index = order(colnames(TPM),rownames(metadata))
TPM = TPM[,index]
but TPM didn't change at all, I also tried this code with match and intersect. Still TPM didn't change at all. I don't want to change metadata's rownames, it's already in the order I want.
Why TPM's colnames aren't changing?
Sample of TPM:
structure(list(Pt1 = c(12.1089467388298, 0, 18.0385362276576,
1.92790576596844, 94.2409551672849, 16.8882703013677, 23.1934795882213
), Pt10 = c(5.31468107381049, 0, 19.5665754959778, 6.08224068432115,
188.508358461147, 8.48446380342082, 9.79919096042849), Pt8 = c(9.07821549589067,
0, 14.5403817716173, 9.291006716028, 101.817341286849, 12.4642830001982,
23.666833737613), Pt101 = c(6.62551158552007, 0, 44.4975176830514,
5.6097199560158, 29.3442077622205, 4.47109939839761, 22.5645567722583
), Pt103 = c(18.1736419550473, 0, 19.8694720219646, 0.385296099264051,
10.7494896282499, 5.30680045140835, 6.25389746004431), Pt106 = c(7.28900047884896,
0.25821427874803, 23.1149486669295, 80.936069669191, 117.783929643365,
10.167975612355, 23.7821668347939), Pt11 = c(2.81580497213944,
0, 12.8578712768363, 1.8638846278409, 66.1843554522191, 12.0730665529163,
18.8979998503006)), row.names = c("A1BG", "NAT2", "ADA", "CDH2",
"AKT3", "ZBTB11-AS1", "MED6"), class = "data.frame")
Sample of metadata:
structure(list(Cohort = c("NIV3-PROG", "NIV3-NAIVE", "NIV3-NAIVE",
"NIV3-PROG", "NIV3-PROG", "NIV3-NAIVE", "NIV3-PROG"), Response = c("PD",
"SD", "PD", "PD", "PD", "PD", "SD"), `Dead/Alive
(Dead = True)` = c(TRUE,
TRUE, TRUE, FALSE, TRUE, TRUE, TRUE), `Time to Death
(weeks)` = c(22.85714286,
36.57142857, 37, 69.14285714, 13, 119.5714286, 61.28571429),
Subtype = c("CUTANEOUS", "CUTANEOUS", "MUCOSAL", "CUTANEOUS",
"MUCOSAL", "CUTANEOUS", "OCULAR/UVEAL"), `Mutational
Subtype` = c("NA",
"NF1", "TripleWt", "TripleWt", "BRAF", "BRAF", "TripleWt"
), `M Stage` = c("M1C", "M1A", "M0", "M1B", "M1C", "NA",
"M1C"), `Mutation Load` = c("NA", "75", "87", "21", "700",
"106", "33"), `Neo-antigen Load` = c("NA", "33", "44", "5",
"219", "67", "14"), `Neo-peptide Load` = c("NA", "56", "69",
"11", "273", "187", "25"), `Cytolytic Score` = c("977.86911190000001",
"65.840716889999996", "457.43633440000002", "1108.8620289999999",
"645.54163300000005", "602.6740413", "228.61321050000001"
), Benefit = c("NoResponse", "NoResponse", "NoResponse",
"NoResponse", "NoResponse", "NoResponse", "NoResponse")), row.names = c("Pt1",
"Pt10", "Pt8", "Pt103", "Pt106", "Pt11", "Pt82"), class = "data.frame")

I think you can just sort the INDEX vector:
INDEX=sort(intersect(colnames(TPM),rownames(metadata)))
metadata=metadata[INDEX,]
TPM=TPM[,INDEX]

Related

How to generate heatmap for heterogenous data type ( table contains numerics, logical, character , NA and empty cells) using R? (Not solved yet)

I want to create a heatmap using data frame which contains heterogeneous data (table contains all data types such as numeric values, logical, character , NA and empty cells).
Here is an example dataset that matches the actual dataset I have.
I want to plot "citizen" on y axis and plot all other variables (column) on x-axis.
structure(list(ID = c("ID123", "ID456", "ID523", "ID875", "ID782",
"ID572", "ID900"), Citizen = c("US", "CN", "MX", "US", "US",
"CA", "CA"), Ht = c("6", "NA", "5", "6", "5", NA, "6"), Wt = c("200",
"140", "160", NA, "NA", "175", NA), Age = c("NA", "45", NA, "32",
"60", "44", "30"), income = c("60", "50", "30", "20", "40", "NA",
"20"), sex = c("M", "F", "NA", NA, "M", "M", "F"), `Traffic vio` = c(TRUE,
FALSE, TRUE, FALSE, NA, TRUE, TRUE), Greets = c("Hello", "Bonjour",
"Hola", "Hi", "Hello", "Hello", "Bonjour")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
The first thing you need to do is to convert your character strings containing "NA" with the NA constant.
library(dplyr)
df <- df %>% na_if("NA")
Next you need your numeric data to not be stored as character.
df <- df %>%
mutate(across(Ht:income, as.numeric))
You might want your character columns to be factors, especially Citizen, sex and Greets.
df <- df %>%
mutate(across(where(is.character), factor)
You may want to decide what to do with your NA in Traffic vio - is this more likely a TRUE or FALSE? Leave it be if you want.
df <- df %>%
mutate(`Traffic vio` = if_else(is.na(`Traffic vio`), FALSE, `Traffic vio`))
You can now make a heatmap using geom_tile from ggplot2. If you want to plot summary statistics, like mean, you should probably aggregate your data ahead of time.
df %>%
group_by(Citizen, sex) %>%
summarize(Age = mean(Age, na.rm = TRUE)) %>%
ggplot() +
geom_tile(aes(x = sex, y = Citizen, fill = Age))

error Predictor.new() function package IML in R

I am attempting to use package 'iml' in R to create plots of SHAP values from a GBM model created in H2O.
When I try to create the R6 Predictor object using the Predictor.new() function I get an error that states Error : all(feature.class %in% names(feature.types)) is not TRUE.
From this I am guessing that there is something about one of the feature classes that is incorrect, but this is just an educated guess based upon what the error message is literally saying.
Here is a sample of anonymized data (I can't share the real data because it is confidential):
structure(list(dlr_id_cur = c(1, 2), date_eff = structure(c(16014,
15416), class = "Date"), new_vec_ind = structure(c(1L, 1L), .Label = c("NNA",
"UNA"), class = "factor"), cntrct_term = c(9587879614862828,
19), amt_financed = c(9455359, 65561175), reg_payment = c(885288,
389371), acct_stat_cd = structure(c(3L, 3L), .Label = c("11",
"22", "33"), class = "factor"), base_rental = c(1, 626266), down_pymt = c(2,
6654661), car_count = c(5, 1), dur_lease = c(3974, 6466), returned = structure(1:2, .Label = c("00",
"11"), class = "factor"), state = structure(c(10L, 1L), .Label = c("ANA",
"BNA", "CNA", "DNA", "FNA", "GNA", "HNA", "INA", "KNA", "LNA",
"MNA", "NNA", "ONA", "PNA", "QNA", "RNA", "SNA", "TNA", "UNA",
"VNA", "WNA"), class = "factor"), zip = c(34633, 45222), zip_two_digits = structure(c(71L,
36L), .Label = c("00", "01", "02", "03", "04", "05", "06", "07",
"08", "09", "110", "111", "112", "113", "114", "115", "116",
"117", "118", "119", "220", "221", "222", "223", "224", "225",
"226", "227", "228", "229", "330", "331", "332", "333", "334",
"335", "336", "337", "338", "339", "440", "441", "442", "443",
"444", "445", "446", "447", "448", "449", "550", "551", "552",
"553", "554", "555", "556", "557", "558", "559", "660", "661",
"662", "663", "664", "665", "666", "667", "668", "669", "770",
"771", "772", "773", "774", "775", "776", "777", "778", "779",
"880", "881", "882", "883", "884", "885", "886", "887", "888",
"889", "990", "991", "992", "993", "994", "995", "996", "997",
"998", "999", "ANA", "BNA", "CNA", "ENA", "GNA", "HNA", "JNA",
"KNA", "LNA", "MNA", "NNA", "PNA", "RNA", "SNA", "TNA", "VNA"
), class = "factor")
, mod_year_date = c(8156, 6278), vehic_mod_fam_code = structure(c(2L,
2L), .Label = c("BNA", "CNA", "ENA", "MNA", "SNA", "TNA", "VNA",
"XNA"), class = "factor"), mod_class_code = structure(c(4L, 2L
), .Label = c("BNA", "CNA", "ENA", "GNA", "MNA", "RNA", "SNA"
), class = "factor"), count_dl_DL_CDE_CSPS_A_NP = c(945, 337),
DL_CDE_CSPS_A_NP_avg_dl = c(3355188283749626, 8835582388327814
), count_sv_DL_CDE_CSPS_A_NP = c(6532, 8475), DL_CDE_CSPS_A_NP_avg_sv = c(4471193398278526,
6934672627789796), count_dl_NUM_CSPS_INIT_SCR = c(774, 773
), NUM_CSPS_INIT_SCR_avg_dl = c(9468453388562312, 5847816458727333
), count_sv_NUM_CSPS_INIT_SCR = c(2467, 3882), NUM_CSPS_INIT_SCR_avg_sv = c(5857936629789154,
8963457353776469), count_FFV = c(8563, 2566), average_FFV = c(25697792913881564,
13693335921646120), csps_NUM_SV = c(8, 6), avg_SV_rating = c(9817541424596360,
6218928542331853), csps_FFV_ratio = c(23125612473476952,
2), avg_DL_rating = c(2182256921592387, 7668957586431513),
has_DL_rating = c(1, 8), has_bad_DL_rating = c(2, 4), serv_has_MNT = c(7,
3), serv_has_SCP = c(5, 4), serv_has_ELW = c(9, 4), serv_has_LCP = c(7,
1), ro_count = c(6, 1), ro_tot_cust_pay = c(2, 188759), ro_tot_pay = c(3,
764372), date_eff_weekday = structure(c(4L, 3L), .Label = c("FNA",
"MNA", "SNA", "TNA", "WNA"), class = "factor"), date_eff_month_int = c(83,
7), date_eff_day = c(2, 24)), .Names = c("dlr_id_cur", "date_eff",
"new_vec_ind", "cntrct_term", "amt_financed", "reg_payment",
"acct_stat_cd", "base_rental", "down_pymt", "car_count", "dur_lease",
"returned", "state", "zip", "zip_two_digits", "mod_year_date",
"vehic_mod_fam_code", "mod_class_code", "count_dl_DL_CDE_CSPS_A_NP",
"DL_CDE_CSPS_A_NP_avg_dl", "count_sv_DL_CDE_CSPS_A_NP", "DL_CDE_CSPS_A_NP_avg_sv",
"count_dl_NUM_CSPS_INIT_SCR", "NUM_CSPS_INIT_SCR_avg_dl", "count_sv_NUM_CSPS_INIT_SCR",
"NUM_CSPS_INIT_SCR_avg_sv", "count_FFV", "average_FFV", "csps_NUM_SV",
"avg_SV_rating", "csps_FFV_ratio", "avg_DL_rating", "has_DL_rating",
"has_bad_DL_rating", "serv_has_MNT", "serv_has_SCP", "serv_has_ELW",
"serv_has_LCP", "ro_count", "ro_tot_cust_pay", "ro_tot_pay",
"date_eff_weekday", "date_eff_month_int", "date_eff_day"), row.names = 1:2, class = "data.frame")
# 1. create a data frame with just the features
features_iml <- as.data.frame(df_testR) %>% dplyr::select(-returned)
# 2. Create a vector with the actual responses
response_iml <- as.numeric(as.vector(df_testR$returned))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of customer churn in my example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# 4. example of prediction output
pred(GBM5, features_iml) %>% head()
# 5. create Predictor object
predictor = Predictor$new(model = GBM5, data = features_iml, y =
response_iml, predict.fun = pred, class = "classification")
Error : all(feature.class %in% names(feature.types)) is not TRUE
Here are also so basic descriptions of the dataset and model object I'm
using in the code above:
class(GBM5)
[1] "H2OBinomialModel"
attr(,"package")
[1] "h2o"
class(df_testR)
[1] "tbl_df" "tbl" "data.frame"
dim(df_testR)
[1] 47006 44
If there is anything else I can provide or if I have been unclear please let me know.
In the iml package there are specific feature classes that are acceptable, namely numeric, integer, character, factor and ordered. If you have any Date objects, or any other data type than the 5 listed here than the Predictor object can not be created.

Convert column types to their read_csv() column type in R

One of my favorite things about library(readr) and the read_csv() function in R is that it almost always sets the column types of my data to the correct class. However, I am currently working with an API in R that returns data to me as a dataframe of all character classes, even if the data is clearly numbers. Take this dataframe for example, which has some sports data:
dput(mydf)
structure(list(isUnplayed = c("false", "false", "false"), isInProgress =
c("false", "false", "false"), isCompleted = c("true", "true", "true"), awayScore = c("106",
"95", "95"), homeScore = c("94", "97", "111"), game.ID = c("31176",
"31177", "31178"), game.date = c("2015-10-27", "2015-10-27",
"2015-10-27"), game.time = c("8:00PM", "8:00PM", "10:30PM"),
game.location = c("Philips Arena", "United Center", "Oracle Arena"
), game.awayTeam.ID = c("88", "86", "110"), game.awayTeam.City = c("Detroit",
"Cleveland", "New Orleans"), game.awayTeam.Name = c("Pistons",
"Cavaliers", "Pelicans"), game.awayTeam.Abbreviation = c("DET",
"CLE", "NOP"), game.homeTeam.ID = c("91", "89", "101"), game.homeTeam.City = c("Atlanta",
"Chicago", "Golden State"), game.homeTeam.Name = c("Hawks",
"Bulls", "Warriors"), game.homeTeam.Abbreviation = c("ATL",
"CHI", "GSW"), quarterSummary.quarter = list(structure(list(
`#number` = c("1", "2", "3", "4"), awayScore = c("25",
"23", "34", "24"), homeScore = c("25", "18", "23", "28"
)), .Names = c("#number", "awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)), structure(list(`#number` = c("1", "2", "3", "4"), awayScore = c("17",
"23", "28", "27"), homeScore = c("26", "20", "25", "26")), .Names = c("#number",
"awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)), structure(list(`#number` = c("1", "2", "3", "4"), awayScore = c("35",
"14", "26", "20"), homeScore = c("39", "20", "35", "17")), .Names = c("#number",
"awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)))), .Names = c("isUnplayed", "isInProgress", "isCompleted",
"awayScore", "homeScore", "game.ID", "game.date", "game.time",
"game.location", "game.awayTeam.ID", "game.awayTeam.City", "game.awayTeam.Name",
"game.awayTeam.Abbreviation", "game.homeTeam.ID", "game.homeTeam.City",
"game.homeTeam.Name", "game.homeTeam.Abbreviation", "quarterSummary.quarter"
), class = "data.frame", row.names = c(NA, 3L))
It is quite a hassle to deal with this dataframe once it is returned by the API, given the class types. I've come up with a sort of a hack to update the column classes, which is as follows:
write_csv(mydf, 'mydf.csv')
mydf <- read_csv('mydf.csv')
By writing to CSV and then re-reading the CSV using read_csv(), the dataframe columns update. Unfortunately I am left with a CSV file in my directory that I don't want. Is there a way to update the columns of an R dataframe to their 'read_csv()' column classes, without actually having to write the CSV?
Any help is appreciated!
You don't need to write and read the data if you just want readr to guess you column type. You could use readr::type_convert for that:
iris %>%
dplyr::mutate(Sepal.Width = as.character(Sepal.Width)) %>%
readr::type_convert() %>%
str()
For comparison:
iris %>%
dplyr::mutate(Sepal.Width = as.character(Sepal.Width)) %>%
str()
try this code, type.convert convert a character vector to logical, integer, numeric, complex or factor as appropriate.
indx <- which(sapply(df, is.character))
df[, indx] <- lapply(df[, indx], type.convert)
indx <- which(sapply(df, is.factor))
df[, indx] <- lapply(df[, indx], as.character)

Descriptives for a specified subset of rows in r

I have a long format dataset with each row being another measurement (as indicated by my "timeline.compressed" variable, which has 8 possible values; see dput below).
However, now I want to check the descriptive statistics of some of my variables (i.e., x1-x3) but for each of the timepoints seperately. I've tried using the if function, but that gives me the warning that the condition has >1 in length.
Does anyone perhaps know what code I should use to be able to get summary statistics for each of the timepoints seperately?
dput for table with possible timeline values:
structure(c(7518L, 6178L, 6393L, 5886L, 6121L, 5977L, 7440L,
5886L), .Dim = 8L, .Dimnames = structure(list(c("5", "16", "28",
"40", "52", "64", "79", "95")), .Names = ""), class = "table")
dput for example dataset
structure(list(nomem_encr = c(800009L, 800009L, 800012L, 800015L,
800015L, 800015L), timeline.compressed = c(79, 95, 79, 28, 40,
52), sel = c(4.9, NA, NA, 6.9, 6.7, NA), close_num = c(1, 0.2,
1, 0.8, 1, 0.8), gener_sat = c(7, 7, 8, 7, 7, 5)), .Names = c("ID",
"timeline.compressed", "x1", "x2", "x3"), row.names = c(NA,
6L), class = "data.frame")
Using dplyr you can do, e.g. with timeline_values being your frequency table and df your data
data.frame(timeline.compressed = as.numeric(names(timeline_values))) %>%
left_join(df) %>%
group_by(timeline.compressed) %>%
summarize_all(mean, na.rm = TRUE)

nested data.frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a nested data.frame
dput(res)
structure(list(date = structure(list(pretty = "12:00 PM CDT on August 14, 2015",
year = "2015", mon = "08", mday = "14", hour = "12", min = "00",
tzname = "America/Chicago"), .Names = c("pretty", "year",
"mon", "mday", "hour", "min", "tzname"), class = "data.frame", row.names = 1L),
fog = "0", rain = "1", snow = "0", snowfallm = "0.00", snowfalli = "0.00",
monthtodatesnowfallm = "", monthtodatesnowfalli = "", since1julsnowfallm = "",
since1julsnowfalli = "", snowdepthm = "", snowdepthi = "",
hail = "0", thunder = "0", tornado = "0", meantempm = "26",
meantempi = "79", meandewptm = "17", meandewpti = "63", meanpressurem = "1019",
meanpressurei = "30.09", meanwindspdm = "11", meanwindspdi = "7",
meanwdire = "", meanwdird = "139", meanvism = "16", meanvisi = "10",
humidity = "", maxtempm = "32", maxtempi = "90", mintempm = "21",
mintempi = "69", maxhumidity = "86", minhumidity = "36",
maxdewptm = "18", maxdewpti = "65", mindewptm = "15", mindewpti = "59",
maxpressurem = "1021", maxpressurei = "30.15", minpressurem = "1017",
minpressurei = "30.04", maxwspdm = "19", maxwspdi = "12",
minwspdm = "0", minwspdi = "0", maxvism = "16", maxvisi = "10",
minvism = "16", minvisi = "10", gdegreedays = "29", heatingdegreedays = "0",
coolingdegreedays = "14", precipm = "0.00", precipi = "0.00",
precipsource = "", heatingdegreedaysnormal = "", monthtodateheatingdegreedays = "",
monthtodateheatingdegreedaysnormal = "", since1sepheatingdegreedays = "",
since1sepheatingdegreedaysnormal = "", since1julheatingdegreedays = "",
since1julheatingdegreedaysnormal = "", coolingdegreedaysnormal = "",
monthtodatecoolingdegreedays = "", monthtodatecoolingdegreedaysnormal = "",
since1sepcoolingdegreedays = "", since1sepcoolingdegreedaysnormal = "",
since1jancoolingdegreedays = "", since1jancoolingdegreedaysnormal = ""), .Names = c("date",
"fog", "rain", "snow", "snowfallm", "snowfalli", "monthtodatesnowfallm",
"monthtodatesnowfalli", "since1julsnowfallm", "since1julsnowfalli",
"snowdepthm", "snowdepthi", "hail", "thunder", "tornado", "meantempm",
"meantempi", "meandewptm", "meandewpti", "meanpressurem", "meanpressurei",
"meanwindspdm", "meanwindspdi", "meanwdire", "meanwdird", "meanvism",
"meanvisi", "humidity", "maxtempm", "maxtempi", "mintempm", "mintempi",
"maxhumidity", "minhumidity", "maxdewptm", "maxdewpti", "mindewptm",
"mindewpti", "maxpressurem", "maxpressurei", "minpressurem",
"minpressurei", "maxwspdm", "maxwspdi", "minwspdm", "minwspdi",
"maxvism", "maxvisi", "minvism", "minvisi", "gdegreedays", "heatingdegreedays",
"coolingdegreedays", "precipm", "precipi", "precipsource", "heatingdegreedaysnormal",
"monthtodateheatingdegreedays", "monthtodateheatingdegreedaysnormal",
"since1sepheatingdegreedays", "since1sepheatingdegreedaysnormal",
"since1julheatingdegreedays", "since1julheatingdegreedaysnormal",
"coolingdegreedaysnormal", "monthtodatecoolingdegreedays", "monthtodatecoolingdegreedaysnormal",
"since1sepcoolingdegreedays", "since1sepcoolingdegreedaysnormal",
"since1jancoolingdegreedays", "since1jancoolingdegreedaysnormal"
), class = "data.frame", row.names = 1L)
and I am using the following command to retrieve data from it
df <- data.frame()
df <- rbind(df, ldply(res, function(x) x[[1]]))
To use this data frame, I convert it into data table, using dt <- data.table(df) and now I know how to work with the data, for instance dt[.id=="fog"].
Is there a more elegant/efficient solution?
The problem was solved by #antoine-sac. It was not necessary to use the apply to get the data, it was only a question of "un-nest" the data.
Your problem is that your data is a data.frame and one of its column is date. But date is a data.frame. As you say it is a nested list. So let's "un-nest" it.
You can simply do (assuming your data is in data):
df.date <- data$date
# removing incorrectly formated date from data
data$date <- NULL
At this point, data is a normal data.frame and df.date is also a basic data.frame.
> df.date
pretty year mon mday hour min tzname
1 12:00 PM CDT on August 14, 2015 2015 08 14 12 00 America/Chicago
If you want to merge that with your existing data.frame:
# binding df.date with your data
data <- cbind(data, df.date)
No need for any kind of apply.
Now if you don't know how to access variables in a data.frame, that's another thing.
If you want, say, meantempm, you can simply do data$meantempm.
I refer you to beginner tutorial about R, there are plenty to choose from with a google request.

Resources