Conditionally adding characters to new column based on separate dataset - r

Hello all and thank you in advance.
I would like to add a new column to my pre-existing data frame where the values sourced from a second data frame based on certain conditions. The dataset I wish to add the new column to ("data_melt") has many different sample IDs (sample.#) under the variable column. Using a second dataset ("metadata") I want to add the pond names to the "data_melt" new column based on the sample-ids. The sample IDs are the same in both datasets.
My gut tells me there's an obvious solution but my head is pretty fried. Here is a toy example of my data_melt df (since its 25,000 observations):
> dput(toy)
structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
And here is a toy example of my metadata df:
> dput(toy)
structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
Thank you again!

We can use match from base R to create a numeric index to replace the values
toy$pond <- with(toy, out$pond[match(variable, out$sample)])

I believe merge will work here.
sss <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
ss <- structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
ssss <- merge(sss, ss, by.x = "variable", by.y = "sample")

You can use left_join() from the dplyr package after renaming sample to variable in the metadata data frame.
library(tidyverse)
data_melt <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"),
process = structure(c(1L, 1L, 1L, 1L),
.Label = "energy",
class = "factor"),
category = structure(c(1L, 1L, 1L, 1L),
.Label = "metabolism",
class = "factor"),
ko = structure(1:4,
.Label = c("K00058", "K00093", "K00125", "K00148"),
class = "factor"),
variable = structure(c(1L, 2L, 3L, 3L),
.Label = c("sample.10", "sample.19", "sample.72"),
class = "factor"),
value = c(0.00116, 2.77e-05, 1.84e-05, 0.0125)),
row.names = c(NA, -4L),
class = "data.frame")
metadata <- structure(list(sample = c("sample.10", "sample.19", "sample.72", "sample.13"),
pond = structure(c(2L, 2L, 1L, 1L),
.Label = c("lower", "upper"),
class = "factor")),
row.names = c(NA, -4L),
class = "data.frame") %>%
# Renaming the column, so we can join the two data sets together
rename(variable = sample)
data_melt <- data_melt %>%
left_join(metadata, by = "variable")

Related

Interpretation of error in sem.coef

I am trying to run a sem with a random effect in piecewiseSEM. My model runs with no error, and sem.fit() also runs with no error or warnings. However, when I run sem.coefs() I get the following warning:
1: In if (grepl("cbind", deparse(formula(x)))) all.vars(formula(x))[-c(1:2)] else all.vars(formula(x)) :
the condition has length > 1 and only the first element will be used
Any ideas what this warning is about or what it means? Given it's a warning and not an error, the code still runs and give me estimates, but can I trust the estimates?
Thanks!
EDIT
#code:
library(piecewiseSEM)
library(nlme)
avg.forb<-list( lme(nitrogen_variation~nat+impervious+precip.variation,random=~1|site/species,control = lmeControl(opt = "optim"),forb), lme(po4_variation~nat+impervious+precip.variaton,random=~1|site/species,control = lmeControl(opt = "optim"),forb),
lme(nitrogen~nat +impervious+precip.variation,random=~1|site/species,control = lmeControl(opt = "optim"), forb),
lme(po4 ~nat +impervious+precip.variation,random=~1|site/species,control = lmeControl(opt = "optim"),forb), lme(avg.height~nat+impervious+po4+po4_variation+nitrogen+nitrogen_variation+precip.variation+n_i, random=~1|site/species,control =lmeControl(opt="optim"),forb), lme(avg.culms~nat+impervious+po4+po4_variation+nitrogen+nitrogen_variation+precip.variation+n_i,random=~1|site/species,control = lmeControl(opt = "optim"), forb), lme(avg.chloro~nat+impervious+po4+po4_variation+nitrogen+nitrogen_variation+precip.variation+n_i,random=~1|site/species, control =lmeControl(opt="optim"),forb), lme(avg.sla~nat+impervious+po4+po4_variation+nitrogen+nitrogen_variation+precip.variation+n_i,random=~1|site/species, control = lmeControl(opt = "optim"),forb))
sem.fit(avg.forb, conditional=T, forb) #this code gives the above error message
#data subset:
structure(list(site = structure(c(1L, 1L, 1L, 2L, 2L, 3L), .Label = c("Baker", "Cronkelton", "Delaware"), class = "factor"), species = structure(c(1L, 4L, 6L, 2L, 3L, 5L), .Label = c("apocynum cannabinum", "aster ericoides", "aster lanceolatus var. interior", "cirsium arvense", "impatiens capensis", "typha angustifolia"), class = "factor"), n_i = structure(c(2L,
1L, 1L, 2L, 2L, 2L), .Label = c("i", "n"), class = "factor"),nat=structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("1", "2"), class = "factor"), impervious = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("1", "2"), class = "factor"), precip_variation = c(70.24882178, 70.24882178, 70.24882178, 21.92460821, 21.92460821, 18.90115299), po4 = c(-2.203425667,
-2.204119983, -2.20481541, -1.845271793, -1.844967771, -2.417936637), po4_variation = c(0.8011, 0.801, 0.8009, 0.4839, 0.484, 0.5229), nitrogen = c(0.00627, 0.00626, 0.00625, 0.00432, 0.00433, 0.01018), nitrogen_variation = c(0.7739, 0.7738, 0.7737, 0.5435, 0.5436, -0.1251), avg.height = c(99.1, 113.5559506, 191.4111012, 73.72222025, 35.42222025, 59.52222025), avg.culms = c(0.492915384, 0.78612011, 0.884606749, 0.96483549, 0.819543936, 0.831087338), avg.sla = c(179.3510333, 149.0332471, 68.77888941, 334.2177912, 798.7581389, 443.2005556), avg.chloro = c(0.900670513, 0.790832282, 0.965532685, 0.565585484, 1.106203493, 0.970209082)), .Names = c("site", "species", "n_i", "nat", "impervious", "precip_variation", "po4", "po4_variation", "nitrogen", "nitrogen_variation", "avg.height", "avg.culms", "avg.sla", "avg.chloro"), row.names = c(NA, 6L), class = "data.frame")

Match a pattern within any element of the data using data table rather than plyr

I have a very big data set and have not used data.table before. I am finding the syntax a bit difficult to follow. My main question is how can i reproduce the 'apply' function for a data table?
My data is as follows
dat1 <- structure(list(id = c(1L, 1L, 2L, 3L), diag1 = structure(1:4, .Label = c("I20.1","I21.3", "I48", "I60.8"), class = "factor"), diag2 = structure(c(3L,2L, 1L, 1L), .Label = c("", "I50", "I60.9"), class = "factor"), diag3 = structure(c(1L, 2L, 1L, 1L), .Label = c("", "I38.1"), class = "factor")), .Names = c("id", "diag1", "diag2", "diag3"), row.names = c(NA, -4L), class = "data.frame")
I want to add a variable for all records that have a diagnostic code either within the columns diag1, diag2 or diag 3 of I20, I21 or I60. Using apply and regex i have done the following.
code.list <- c("I20","I21","I60")
dat1$index <- apply(dat1[2:4],1, function(i) any(grep(paste(code.list,
collapse="|"), i)))
I get the final dataset that i want is illustrated as below
structure(list(id = c(1L, 1L, 2L, 3L), diag1 = structure(1:4, .Label = c("I20.1","I21.3", "I48", "I60.8"), class = "factor"), diag2 = structure(c(3L,2L, 1L, 1L), .Label = c("", "I50", "I60.9"), class = "factor"),diag3 = structure(c(1L, 2L, 1L, 1L), .Label = c("", "I38.1"), class = "factor"), index = c(TRUE, TRUE, FALSE, TRUE)), .Names = c("id","diag1", "diag2", "diag3", "index"), row.names = c(NA, -4L), class = "data.frame")
However this is going to take far too long using plyr. I was hoping to get the syntax for a data table. Would anybody be able to help?
Thanks in advance
A
We can do this with data.table
library(data.table)
setDT(dat1)[, index := Reduce(`|`, lapply(.SD, grepl,
pattern = paste(code.list, collapse="|"))), .SDcols = 2:4]
dat1
# id diag1 diag2 diag3 index
#1: 1 I20.1 I60.9 TRUE
#2: 1 I21.3 I50 I38.1 TRUE
#3: 2 I48 FALSE
#4: 3 I60.8 TRUE

R Wide to long format for multiple variables with patterns [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I have a data set with a single identifier and five columns that repeat 18 times. I want to restructure the data into long format keeping the first five column headings as the column headings. Below is a sample with just two repeats:
structure(list(Response.ID = 1:2, Task = structure(c(1L, 1L), .Label = "task1", class = "factor"),
Freq = structure(c(1L, 1L), .Label = "Daily", class = "factor"),
Hours = c(3L, 2L), Value = c(10L, 8L), Mood = structure(1:2, .Label = c("Engaged",
"Neutral"), class = "factor"), Task.1 = structure(c(1L, 1L
), .Label = "task2", class = "factor"), Freq.1 = structure(c(1L,
1L), .Label = "Weekly", class = "factor"), Hours.1 = c(4L,
4L), Value.1 = c(10L, 6L), Mood.1 = structure(c(2L, 1L), .Label = c("Neutral",
"Optimistic"), class = "factor")), .Names = c("Response.ID", "Task", "Freq", "Hours", "Value", "Mood", "Task.1", "Freq.1", "Hours.1", "Value.1", "Mood.1"), class = "data.frame", row.names = c(NA, -2L))
I attempted using the melt and patterns functions, which appears to approximate my desired outcome without the desired column headings:
df = melt(df1, id.vars = c("Response.ID"), measure.vars = patterns("^Task", "^Freq","^Hours","^Mood"))
Here is the result:
structure(list(Response.ID = c(1L, 2L, 1L, 2L), variable = structure(c(1L, 1L, 2L, 2L), class = "factor", .Label = c("1", "2")), value1 = c("task1", "task1", "task2", "task2"), value2 = c("Daily", "Daily", "Weekly", "Weekly"), value3 = c(3L, 2L, 4L, 4L), value4 = c("Engaged", "Neutral", "Optimistic", "Neutral")), .Names = c("Response.ID", "variable", "value1", "value2", "value3", "value4"), row.names = c(NA, -4L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000330788>)
When I tried to specify names with value.name() below I receive an error:
df = melt(df1, id.vars = c("Response.ID"),measure.vars = patterns("^Task", "^Freq","^Hours","^Mood"), value.name=c("Task", "Freq", "Hours", "Value","Mood"))
My desired result would look like this:
structure(list(Response.ID = c(1L, 2L, 1L, 2L), Task = structure(c(1L, 1L, 2L, 2L), .Label = c("task1", "task2"), class = "factor"),
Freq = structure(c(1L, 1L, 2L, 2L), .Label = c("Daily", "Weekly"
), class = "factor"), Hours = c(3L, 2L, 4L, 4L), Value = c(10L,
8L, 10L, 6L), Mood = structure(c(1L, 2L, 3L, 2L), .Label = c("Engaged",
"Neutral", "Optimistic"), class = "factor")), .Names = c("Response.ID", "Task", "Freq", "Hours", "Value", "Mood"), class = "data.frame", row.names = c(NA, -4L))
It looks to me like you embarked on a difficult journey by using melt: this function is well named in the sense that trying to use it will probably melt your brain. Joke aside, the function melt has lots of underlying computations and its use could be inefficient if you have a large dataset.
I would instead solve the problem manually with rbindlist (from the excellent package data.table, which also ships with an optimized version of melt if you really want to use it), to manually concatenates groups of columns. This also preserves the column names:
> rbindlist(lapply(1:2, function(i) df1[,c(1,((i-1)*5+2):((i-1)*5+6))]))
Response.ID Task Freq Hours Value Mood
1: 1 task1 Daily 3 10 Engaged
2: 2 task1 Daily 2 8 Neutral
3: 1 task2 Weekly 4 10 Optimistic
4: 2 task2 Weekly 4 6 Neutral
This works on your example: replace the indices 1:2 by the number of repetitions to make it work with the real dataset (so, lapply(1:18)).

how to remove duplicated strings and merge all columns strings in one?

I have a data looks like the following df
df<- structure(list(V1 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("DNAJC11;FGOTG",
"MAPK14", "PPIB", "RBX1", "USP14"), class = "factor"), V2 = structure(c(4L,
3L, 2L, 1L, 1L), .Label = c("", "DNAJC9", "MAPK14", "USP14"), class = "factor"),
V3 = structure(c(3L, 2L, 4L, 5L, 1L), .Label = c("", "DNAJC11;FGOTG",
"GCLC", "GSR", "STIP1"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -5L))
I want to merge all columns into one and then keep the unique ones
for example the output should look like this
USP14
DNAJC11;FGOTG
MAPK14
PPIB
RBX1
DNAJC9
GCLC
GSR
STIP1
I tried to use meltfunction but I could not figure out how to do this, any comment is appreciated. Thanks
unique(as.vector(as.matrix(df)))
To remove the entries with no characters:
vec<-unique(as.vector(as.matrix(df)))
vec[-which(vec=="")]
or, courtesy #rawr
Filter(nzchar, unique(as.vector(as.matrix(df))))

change the names for certain columns in a data frame [duplicate]

This question already has answers here:
Changing column names of a data frame
(18 answers)
Closed 7 years ago.
If I want to change the name from 2 column to the end , why my command does not work ?
fredTable <- structure(list(Symbol = structure(c(3L, 1L, 4L, 2L, 5L), .Label = c("CASACBM027SBOG",
"FRPACBW027SBOG", "TLAACBM027SBOG", "TOTBKCR", "USNIM"), class = "factor"),
Name = structure(1:5, .Label = c("bankAssets", "bankCash",
"bankCredWk", "bankFFRRPWk", "bankIntMargQtr"), class = "factor"),
Category = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Banks", class = "factor"),
Country = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "USA", class = "factor"),
Lead = structure(c(1L, 1L, 3L, 3L, 2L), .Label = c("Monthly",
"Quarterly", "Weekly"), class = "factor"), Freq = structure(c(2L,
1L, 3L, 3L, 4L), .Label = c("1947-01-01", "1973-01-01", "1973-01-03",
"1984-01-01"), class = "factor"), Start = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "Current", class = "factor"), End = c(TRUE,
TRUE, TRUE, TRUE, FALSE), SeasAdj = c(FALSE, FALSE, FALSE,
FALSE, TRUE), Percent = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Fed", class = "factor"),
Source = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Res", class = "factor"),
Series = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("Level",
"Ratio"), class = "factor")), .Names = c("Symbol", "Name",
"Category", "Country", "Lead", "Freq", "Start", "End", "SeasAdj",
"Percent", "Source", "Series"), row.names = c("1", "2", "3",
"4", "5"), class = "data.frame")
Then in order to change the second column name to the end I use the following order but does not work
names(fredTable[,-1]) = paste("case", 1:ncol(fredTable[,-1]), sep = "")
or
names(fredTable)[,-1] = paste("case", 1:ncol(fredTable)[,-1], sep = "")
In general how one can change column names of specific columns for example
2 to end, 2 to 7 and etc and set it as the name s/he like
Replace specific column names by subsetting on the outside of the function, not within the names function as in your first attempt:
> names(fredTable)[-1] <- paste("case", 1:ncol(fredTable[,-1]), sep = "")
Explanation
If we save the new names in a vector newnames we can investigate what is going on under the hood with replacement functions.
#These are the names that will replace the old names
newnames <- paste("case", 1:ncol(fredTable[,-1]), sep = "")
We should always replace specific column names with the format:
#The right way to replace the second name only
names(df)[2] <- "newvalue"
#The wrong way
names(df[2]) <- "newvalue"
The problem is that you are attempting to create a new vector of column names then assign the output to the data frame. These two operations are simultaneously completed in the correct replacement.
The right way [Internal]
We can expand the function call with:
#We enter this:
names(fredTable)[-1] <- newnames
#This is carried out on the inside
`names<-`(fredTable, `[<-`(names(fredTable), -1, newnames))
The wrong way [Internal]
The internals of replacement the wrong way are like this:
#Wrong way
names(fredTable[-1]) <- newnames
#Wrong way Internal
`names<-`(fredTable[-1], newnames)
Notice that there is no `[<-` assignment. The subsetted data frame fredTable[-1] does not exist in the global environment so no assignment for `names<-` occurs.

Resources