how to remove duplicated strings and merge all columns strings in one? - r

I have a data looks like the following df
df<- structure(list(V1 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("DNAJC11;FGOTG",
"MAPK14", "PPIB", "RBX1", "USP14"), class = "factor"), V2 = structure(c(4L,
3L, 2L, 1L, 1L), .Label = c("", "DNAJC9", "MAPK14", "USP14"), class = "factor"),
V3 = structure(c(3L, 2L, 4L, 5L, 1L), .Label = c("", "DNAJC11;FGOTG",
"GCLC", "GSR", "STIP1"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -5L))
I want to merge all columns into one and then keep the unique ones
for example the output should look like this
USP14
DNAJC11;FGOTG
MAPK14
PPIB
RBX1
DNAJC9
GCLC
GSR
STIP1
I tried to use meltfunction but I could not figure out how to do this, any comment is appreciated. Thanks

unique(as.vector(as.matrix(df)))
To remove the entries with no characters:
vec<-unique(as.vector(as.matrix(df)))
vec[-which(vec=="")]
or, courtesy #rawr
Filter(nzchar, unique(as.vector(as.matrix(df))))

Related

Issues with melt() in R not matching names?

I am trying to convert the following df from wide to long
Input:
structure(list(activity_level = structure(1:4, .Label = c("Sedentary",
"Lightly Active", "Moderately Active", "Very Active"), class = "factor"),
poor_sleepers = c(0.254032258064516, 0.258695652173913, 0.333333333333333,
0.253119429590018), normal_sleepers = c(0.332661290322581,
0.360869565217391, 0.318181818181818, 0.42602495543672),
excess_sleepers = c(0.413306451612903, 0.380434782608696,
0.348484848484849, 0.320855614973262)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
Using the following line and the melt function:
daily_sleep_byActivity_long <- melt(daily_sleep_byActivity, id.vars = "activity_level")
So that I get a result like this:
structure(list(user_type = structure(c(1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L), .Label = c("Sedentary", "Lightly Active",
"Fairly Active", "Very Active"), class = "factor"), variable = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("bad_sleepers",
"normal_sleepers", "over_sleepers"), class = "factor"), value = c(0.32076169347384,
0.0601633003867641, 0.133333333333333, 0.175548589341693, 0.594379737474579,
0.778685002148689, 0.866666666666667, 0.824451410658307, 0.0848585690515807,
0.161151697464547, 0, 0)), row.names = c(NA, -12L), class = "data.frame")
As far as I can tell, my syntax is clear and the input has all the correct formats, but my out put tells me "names do not match previous names". Does anyone know if I have missed something? My research online hasn't addressed this question.

Conditionally adding characters to new column based on separate dataset

Hello all and thank you in advance.
I would like to add a new column to my pre-existing data frame where the values sourced from a second data frame based on certain conditions. The dataset I wish to add the new column to ("data_melt") has many different sample IDs (sample.#) under the variable column. Using a second dataset ("metadata") I want to add the pond names to the "data_melt" new column based on the sample-ids. The sample IDs are the same in both datasets.
My gut tells me there's an obvious solution but my head is pretty fried. Here is a toy example of my data_melt df (since its 25,000 observations):
> dput(toy)
structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
And here is a toy example of my metadata df:
> dput(toy)
structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
Thank you again!
We can use match from base R to create a numeric index to replace the values
toy$pond <- with(toy, out$pond[match(variable, out$sample)])
I believe merge will work here.
sss <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
ss <- structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
ssss <- merge(sss, ss, by.x = "variable", by.y = "sample")
You can use left_join() from the dplyr package after renaming sample to variable in the metadata data frame.
library(tidyverse)
data_melt <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"),
process = structure(c(1L, 1L, 1L, 1L),
.Label = "energy",
class = "factor"),
category = structure(c(1L, 1L, 1L, 1L),
.Label = "metabolism",
class = "factor"),
ko = structure(1:4,
.Label = c("K00058", "K00093", "K00125", "K00148"),
class = "factor"),
variable = structure(c(1L, 2L, 3L, 3L),
.Label = c("sample.10", "sample.19", "sample.72"),
class = "factor"),
value = c(0.00116, 2.77e-05, 1.84e-05, 0.0125)),
row.names = c(NA, -4L),
class = "data.frame")
metadata <- structure(list(sample = c("sample.10", "sample.19", "sample.72", "sample.13"),
pond = structure(c(2L, 2L, 1L, 1L),
.Label = c("lower", "upper"),
class = "factor")),
row.names = c(NA, -4L),
class = "data.frame") %>%
# Renaming the column, so we can join the two data sets together
rename(variable = sample)
data_melt <- data_melt %>%
left_join(metadata, by = "variable")

Finding best (or worst) synergizing champions in LoL pro games using R

I have some data that looks like this: https://i.imgur.com/hzEd7bT.png
These will be pro league of legends matches once they occur over the course of the next few months. I filled out a few as examples.
Rows 6-10 are champions that each team banned. Rows 11-15 are champions that each team picked.
Each week has about 10 games and there are 9 weeks.
The B and R at the top are Blue (side) and Red (side) in the game. Blue side always gets first choice of champion and red side always gets last choice.
I want to find the best (or worst) synergizing champions
To clarify what I mean by this, in my screenshot the team with Brand and Yuumi won both times while the team with Aurelion Sol and Azir lost both times.
Optimally, I want to know how many times a 2, 3, 4, or 5 characters were picked and the corresponding winrate.
Edit: I am not sure exactly how the data needs to look in R because I have never done this before, but I made two different versions of inputting it below.
LoLGames <- matrix(c('W','Annie','Ezreal','Yuumi','Camille','Brand',
'L','Nasus', 'Aurelion Sol', 'Azir', 'Blitzcrank', 'Caitlyn',
'L','Nasus', 'Aurelion Sol', 'Blitzcrank', 'Ezreal', 'Camille',
'W','Bard', 'Ashe', 'Yuumi', 'Kogmaw', 'Brand'),
ncol = 6, byrow = TRUE)
colnames(LoLGames) <- c("Result","Champ1","Champ2","Champ3","Champ4","Champ5")
rownames(LoLGames) <- c("Game1","Game2","Game3","Game4")
LoLGames <- as.table(LoLGames)
*Corresponding dput
structure(c("W", "L", "L", "W", "Annie", "Nasus", "Nasus", "Bard",
"Ezreal", "Aurelion Sol", "Aurelion Sol", "Ashe", "Yuumi", "Azir",
"Blitzcrank", "Yuumi", "Camille", "Blitzcrank", "Ezreal", "Kogmaw",
"Brand", "Caitlyn", "Camille", "Brand"), .Dim = c(4L, 6L), .Dimnames = list(
c("Game1", "Game2", "Game3", "Game4"), c("Result", "Champ1",
"Champ2", "Champ3", "Champ4", "Champ5")), class = "table")
Result <- c('W','L','L','W',NA)
G1W <- c('Annie','Ezreal','Yuumi','Camille','Brand')
G1L <- c('Nasus', 'Aurelion Sol', 'Blitzcrank', 'Ezreal', 'Camille')
G2L <- c('Nasus', 'Aurelion Sol', 'Blitzcrank', 'Ezreal', 'Camille')
G2W <- c('Bard', 'Ashe', 'Yuumi', 'Kogmaw', 'Brand')
LoLDf <- data.frame(Result, G1W, G1L, G2L, G2W)
*Corresponding dput
structure(list(Result = structure(c(2L, 1L, 1L, 2L, NA), .Label = c("L",
"W"), class = "factor"), G1W = structure(c(1L, 4L, 5L, 3L, 2L
), .Label = c("Annie", "Brand", "Camille", "Ezreal", "Yuumi"), class = "factor"),
G1L = structure(c(5L, 1L, 2L, 4L, 3L), .Label = c("Aurelion Sol",
"Blitzcrank", "Camille", "Ezreal", "Nasus"), class = "factor"),
G2L = structure(c(5L, 1L, 2L, 4L, 3L), .Label = c("Aurelion Sol",
"Blitzcrank", "Camille", "Ezreal", "Nasus"), class = "factor"),
G2W = structure(c(2L, 1L, 5L, 4L, 3L), .Label = c("Ashe",
"Bard", "Brand", "Kogmaw", "Yuumi"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))

Convert all variables into ordered factors

I am using the semTools package to carry out EFA using categorical data. The efaUnrotate() function requires variables as ordered factors.
I am trying to convert all of my already factor variables into an ordered one using a simple code, which does not seem to work unfortunately. I wonder if anyone had an explanation for this?
My data:
test <- structure(list(fp_weightloss = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), fp_gripstrength = structure(c(1L,
2L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
fp_walktime = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), fp_metmins = structure(c(2L, 1L,
1L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
My code:
test_ord <- as.data.frame(sapply(test, as.ordered))
sapply(test_ord, class)
Results in no change:
fp_weightloss fp_gripstrength fp_walktime fp_metmins
"factor" "factor" "factor" "factor"
When I would expect:
class(as.ordered(test$fp_weightloss))
[1] "ordered" "factor"
The problem is sapply: best avoid it entirely, since its implicit conversions often invisibly mess with data, and they do here. Use lapply instead:
test_ord <- as.data.frame(lapply(test, as.ordered))
In general I prefer using vapply since it handles non-list return values, but getting vapply to work with S3 classes doesn’t seem possible.

change the names for certain columns in a data frame [duplicate]

This question already has answers here:
Changing column names of a data frame
(18 answers)
Closed 7 years ago.
If I want to change the name from 2 column to the end , why my command does not work ?
fredTable <- structure(list(Symbol = structure(c(3L, 1L, 4L, 2L, 5L), .Label = c("CASACBM027SBOG",
"FRPACBW027SBOG", "TLAACBM027SBOG", "TOTBKCR", "USNIM"), class = "factor"),
Name = structure(1:5, .Label = c("bankAssets", "bankCash",
"bankCredWk", "bankFFRRPWk", "bankIntMargQtr"), class = "factor"),
Category = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Banks", class = "factor"),
Country = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "USA", class = "factor"),
Lead = structure(c(1L, 1L, 3L, 3L, 2L), .Label = c("Monthly",
"Quarterly", "Weekly"), class = "factor"), Freq = structure(c(2L,
1L, 3L, 3L, 4L), .Label = c("1947-01-01", "1973-01-01", "1973-01-03",
"1984-01-01"), class = "factor"), Start = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "Current", class = "factor"), End = c(TRUE,
TRUE, TRUE, TRUE, FALSE), SeasAdj = c(FALSE, FALSE, FALSE,
FALSE, TRUE), Percent = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Fed", class = "factor"),
Source = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Res", class = "factor"),
Series = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("Level",
"Ratio"), class = "factor")), .Names = c("Symbol", "Name",
"Category", "Country", "Lead", "Freq", "Start", "End", "SeasAdj",
"Percent", "Source", "Series"), row.names = c("1", "2", "3",
"4", "5"), class = "data.frame")
Then in order to change the second column name to the end I use the following order but does not work
names(fredTable[,-1]) = paste("case", 1:ncol(fredTable[,-1]), sep = "")
or
names(fredTable)[,-1] = paste("case", 1:ncol(fredTable)[,-1], sep = "")
In general how one can change column names of specific columns for example
2 to end, 2 to 7 and etc and set it as the name s/he like
Replace specific column names by subsetting on the outside of the function, not within the names function as in your first attempt:
> names(fredTable)[-1] <- paste("case", 1:ncol(fredTable[,-1]), sep = "")
Explanation
If we save the new names in a vector newnames we can investigate what is going on under the hood with replacement functions.
#These are the names that will replace the old names
newnames <- paste("case", 1:ncol(fredTable[,-1]), sep = "")
We should always replace specific column names with the format:
#The right way to replace the second name only
names(df)[2] <- "newvalue"
#The wrong way
names(df[2]) <- "newvalue"
The problem is that you are attempting to create a new vector of column names then assign the output to the data frame. These two operations are simultaneously completed in the correct replacement.
The right way [Internal]
We can expand the function call with:
#We enter this:
names(fredTable)[-1] <- newnames
#This is carried out on the inside
`names<-`(fredTable, `[<-`(names(fredTable), -1, newnames))
The wrong way [Internal]
The internals of replacement the wrong way are like this:
#Wrong way
names(fredTable[-1]) <- newnames
#Wrong way Internal
`names<-`(fredTable[-1], newnames)
Notice that there is no `[<-` assignment. The subsetted data frame fredTable[-1] does not exist in the global environment so no assignment for `names<-` occurs.

Resources