So, I am new in R and trying to implement a differential gene expression analysis.
I'm trying to store gene names as rownames so that I can create a DGEList object.
asthma <- read.csv("Asthma_3 groups-Our study gene expression.csv")
head(asthma, 10)
dim(asthma)
asthma <- na.omit(asthma)
distinct(asthma)
countdata <- asthma[,-1]
head(countdata)
rownames(countdata) <- asthma[,1]
'''
I am getting this error:
Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
The first column in asthma likely has duplicate values. Two options I can think of
Can the first column be combined with another column to generate a new column with unique values that can be used as the rownames?
If not, you can probably use make.names().
Here is a reproducible example.
df = data.frame(col1 = c('A', 'A', 'B'), col2 = c(1, 2, 3))
df
That defines a data.frame that looks like this
col1 col2
1 A 1
2 A 2
3 B 3
The data.frame by default has rownames 1, 2, 3. If you try this
rownames(df) = df[,1]
you get an error because df[,1] has 'A' twice, so it can't be used as a rowname without modification. You use make.names to create rownames with unique values like this
unique.col1 = make.names(df[,1], unique=T)
unique.col1
This results in
"A" "A.1" "B"
Note that the .1 was added to the second A to make it different from the first A. Then define the rownames as unique.col1:
rownames(df) = unique.col1
df
The data.frame df now looks like this
col1 col2
A A 1
A.1 A 2
B B 3
Related
I have a dataset on Rstudio made of columns that contains lists inside them. Here is an example where column "a" and column "c" contain lists in each row.
¿What I am looking for?
I need to create a new column that collects unique values from columns a b and c and that skips NA or null values
Expected result is column "desired_result".
test <- tibble(a = list(c("x1","x2"), c("x1","x3"),"x3"),
b = c("x1", NA,NA),
c = list(c("x1","x4"),"x4","x2"),
desired_result = list(c("x1","x2","x4"),c("x1","x3","x4"),c("x2","x3")))
What i have tried so far?
I tried the following but do not produces the expected result as in column "desired_result
test$attempt_1_ <-lapply(apply((test[, c("a","b","c"), drop = T]),
MARGIN = 1, FUN= c, use.names= FALSE),unique)
We may use pmap to loop over each of the corresponding elements of 'a' to 'c', remove the NA (na.omit) and get the unique values to store as a list in 'desired_result'
library(dplyr)
library(purrr)
test <- test %>%
mutate(desired_result2 = pmap(across(a:c), ~ sort(unique(na.omit(c(...))))))
-checking with OP's expected
> all.equal(test$desired_result, test$desired_result2)
[1] TRUE
I have a series of character vectors in which for every participant (denoted in ReprEx as a letter), there is a time point (in RePrex either 1 or 2), and then a score. Here is the ReprEx:
l <- c("A","1","27","B","1","26","2","54")
How can I reshape the vector to create a dataframe that has three columns, with Column A as participant, Column B as Time Point, and Column C as Score?
The intended output would like something like this:
data.frame("Participant" = c("A","B","B"),
"Time Point" = c("1","1","2"),
"Score" = c("27","26","54"))
If easier to make, it could be brought into this shape:
data.frame("Participant" = c("A","B"),
"TimePoint1" = c("27","26"),
"TimePoint2" = c("NA","54"))
Any direction/thoughts are appreciated.
Here is one way in base R.
Based on some pattern in Participant name we can find their position using grep. In the example shared the pattern is every Participant has an upper-case letter. We use their position to split data so each Participant has their own list. We use the first value in each list as Participant name and alternate values as Time.point and Score respectively.
output <- do.call(rbind, lapply(split(l,
findInterval(seq_along(l), grep('[A-Z]', l))), function(x) {
data.frame(Participant = x[1],
Time.Point = x[-1][c(TRUE, FALSE)],
Score = x[-1][c(FALSE, TRUE)])
}))
rownames(output) <- NULL
output <- type.convert(output)
output
# Participant Time.Point Score
#1 A 1 27
#2 B 1 26
#3 B 2 54
I'm sorry for the basic question. I'm just struggling with something that should be simple. Say I have the the data frame "Test" that originally has three fields: Col1, Col2, Col3.
I want to create new columns based on each of the original columns. The values in each row of the new columns would specify whether the corresponding value in the matching row on the original column is above or below the initial column's median. So, for example, in the image attached, Col4 is based on Col1. Col5 is based on Col2. Col6 based on Col3.
test dataframe example:
It's quite easy to perform this function on a single column and output a single column:
Test <- Test %>% mutate(Col4 = derivedFactor(
"below"= Col1 > median(Test$Col1),
"at"= Col1 == median(Test$Col1),
"above"= Col1 < median(Test$Col1)
.default = NA)
)
But if I'm performing this same operation over 50 columns, writing out/copy-paste and editing the code can be tedious and inefficient. I should mention that I am hoping to add the new columns to the data frame, not create another data frame. Additionally, there are about 200 other fields in the data frame that will not have this function performed on them (so I can't just use a mutate_all). And the columns are not uniformly named (my examples above are just examples, not the actual dataset) so I'm not able to find a pattern for mutate_at. Maybe there is a way to manually pass a list of column names to the mutate command?
There must be an easy and elegant way to do this. If anyone could help, that would be amazing.
You can do the following using data.table.
Firstly, I define a function which is applied onto a numeric vector, whereby it outputs the elements' corresponding position in relation to the vector's median:
med_fn = function(x){
med = median(x)
unlist(sapply(x, function(x){
if(x > med) {'Above'}
else if(x < med) {'Below'}
else {'At'}
}))
}
> med_fn(c(1,2,3))
[1] "Below" "At" "Above"
Let us examine some sample data:
dt = data.table(
C1 = c(1, 2, 3),
C2 = c(2, 1, 3),
C3 = c(3, 2, 1)
)
old = c('C1', 'C2', 'C3') # Name of columns I want to perform operation on
new = paste0(old, '_medfn') # Name of new columns following operation
Using the .SD and .SDcols arguments from data.table, I apply med_fn across the columns old, in my case columns C1, C2 and C3. I call the new columns C#_medfn:
dt[, (new) := lapply(.SD, med_fn), .SDcols = old]
Result:
> dt
C1 C2 C3 C1_medfn C2_medfn C3_medfn
1: 1 2 3 Below At Above
2: 2 1 2 At Below At
3: 3 3 1 Above Above Below
how can I store the output of sapply() to a dataframe where the index value is stored in first column and its value in corresponding 2nd column. For illustration, I have shown only 2 elements here, but there are 110 columns in my data. "loan" is the data frame.
cols <- sapply(loan,function(x) sum(is.na(x)))
cols
id
0
member_id
7
I want output as:
var value
id 0
member_id 7
I know that sapply() returns a vector, but when I print the vector, values are printed along with its some "index" e.g., column name if applied on a data frame. So, now when I want to store it as a data frame with two columns where 1st column contains the index part and the second column contains the value, how can I do it?
I found an answer to my question. For those who actually did understand my problem, this answer might make sense:
cols <- data.frame(sapply(loan ,function(x) sum(is.na(x))))
cols <- cbind(variable = row.names(cols), cols)
I wanted the row.names to be in a column of the same data frame corresponding to the values obtained from sapply.
We can use stack
stack(mylist)[2:1]
data
mylist <- list(df = 1, rf = 2)
Is this what you want?
Your original list:
L <- c("df",1,"rf",2)
L
[1] "df" "1" "rf" "2"
As a data frame:
N <- length(L)
df <- data.frame( var = L[seq(1,N,2)], value = L[seq(2,N,2)] )
df
var value
1 df 1
2 rf 2
I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.