Set first column as rowname, in spite of duplicates

Set first column as rowname, in spite of duplicates - r

sample
Symobls IDs Value1 Value2 Value3
1 NA NA 3.1 2.3 1.7
2 TP53 1234 5.8 6.9 10.1
3 Kras 5678 0.1 0.3 0.5
4 NA NA 10.3 2.1 7.9
5 Hras 9991 20.0 30.0 40.0
6 TP53 1234 -3.1 0.2 1.7
My table looks like this one.
I need to calculate values by row instead or column.
So, I tried to Use Symbols as new row names. In this way, I can calculate whole row value by using sample[,"Hras"]
When tried to do this, I encountered this problem.
rownames(sample)<-sample[,1]
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘A1CF’, ‘A2M’, ‘A2ML1’, ‘AAGAB’, ‘AAK1’, ‘AAMDC’, ‘AARS2’, ‘AASDH’, ‘AASDHPPT’, ‘AASS’, ‘ABAT’, ‘ABCA1’, ‘ABCA13’, ‘ABCA2’, ‘ABCA4’, ‘ABCA5’, ‘ABCA8’, ‘ABCA9’, ‘ABCB1’, ‘ABCB11’, ‘ABCB4’, ‘ABCB5’, ‘ABCB6’, ‘ABCB8’, ‘ABCB9’, ‘ABCC1’, ‘ABCC10’, ‘ABCC11’, ‘ABCC12’, ‘ABCC13’, ‘ABCC3’, ‘ABCC4’, ‘ABCC5’, ‘ABCC6’, ‘ABCC8’, ‘ABCC9’, ‘ABCD3’, ‘ABCD4’, ‘ABCE1’, ‘ABCF2’, ‘ABCG1’, ‘ABHD1’, ‘ABHD10’, ‘ABHD11’, ‘ABHD12’, ‘ABHD13’, ‘ABHD17B’, ‘ABHD2’, ‘ABHD5’, ‘ABHD6’, ‘ABI1’, ‘ABI2’, ‘ABI3BP’, ‘ABL2’, ‘ABLIM1’, ‘ABLIM2’, ‘ABO’, ‘ABR’, ‘ABRA’, ‘ABTB1’, ‘ABTB2’, ‘ACAA1’, ‘ACAA2’, ‘ACACA’, ‘ACACB’, ‘ACAD10’, ‘ACADL’, ‘ACADSB’, ‘ACAN’, ‘ACAP1’, ‘ACAP2’, ‘ACAP3’, ‘ACAT1’, �� [... truncated]
Is this because of the "NA"? Other options?
Thanks
This is a microarray dataset. I have done normalization and going to extract values of several genes to perform plot, cross-correlation and t-test. In fact, not only NA but several genes that I am going to use for plotting figures have multiple rows. So, I need to extract them into another table for later use.

Here, I am just answering a way to change the row.names as you requested in the question. The ultimate goal is not clear. For the specified problem, you could try using make.names with option unique=TRUE. This will make sure that duplicates are named differently. In the first column, there are NA values, which will be named as NA., NA..1 etc.. (if that is okay for you).
row.names(sample) <- make.names(sample[,1],TRUE)
Or as commented by #Richard Scriven,
row.names(sample) <- paste(make.unique(df[,1]))
Another option would be to convert data.frame to matrix (which will permit duplicate values). I would recommend this only if the columns are of the same class. For example, if you have character and numeric columns, this will convert all the columns to character class. In your dataset, it seems to me that except the first column, all others are numeric (with the possible exception of "IDs" column). But again the NA values would be a problem. If you want to subset the '1st' or '3rd' row based on the rownames, it will be difficult.
sample1 <- as.matrix(sample[,-1])
row.names(sample1) <- sample[,1]
sample1['Hras',]
# IDs Value1 Value2 Value3
# 9991 20 30 40

Related

How can I append a dataframe and add a row with repeated measure column info within the same dataframe

So I need an idea of how I can duplicate several fields in a row of data in a df with the exception of the species (spec) and other length measures and append the df with that row of information and length.
len1 is the length of each specimen, but I need to convert each length measure to it's own row in the dataframe while duplicating the other measures (sal,DO,temp,mo,year) I can convert empty length fields to "." to help with coding I know, but any suggestion on a starting point or coding direction would be greatly appreciated.
I am just getting back into really using R for work now instead of grad school so I'm a little rusty, but getting there. First time using stackoverflow so apologize if I'm not following some norms.
Starting to get familiar with dplyr and reshape but any libraries or tutorials for something like this are greatly appreciated.
year mo temp sal DO spec len1 len2 len3
2019 1 15 7.2 8.31 ooo
2019 1 15.5 5.2 8.75 atc 175
2019 1 15.5 5.2 8.75 cfc 135 156
2019 1 15.5 5.2 8.75 men 181 206 174
For the example data above trying to get to where the second length for cfc moves to len1 on a new row in the dataframe
year mo temp sal DO spec len1
2019 1 15.5 5.2 8.75 cfc 156

You can use gather from tidyr. This converts the column name into a value in a new column.
For example assuming your data is in a dataframe called df
library(tidyverse)
df%>%
gather( key = "sampleID", # name for new column that will contain "NA, len1, len1,l1n2, etc)
value = "length", # name for new column that will contain length values
len1:len3 # columns to include in the process
)
At this point you could drop the "sampleID" column if you wish, for example with
df %>% select(-sampleID)
# or other equivalent approaches
select(df,-sampleID) # same command using tidyr without the pipe
df$sampleID <- NULL # base R approach

R subset exclusion based on string creates extra column

I have a data set as such below
salaries <- read.csv('salaries.csv', header=TRUE)
print(salaries)
Name Job Salary CompanyExperience IndustryExperience
John Engineer 50000 3 12
Adam Manager 55000 6 7
Alice Manager #N/A 6 6
Bob Engineer 65000 5 #N/A
Carl Engineer 70000 #N/A 10
I would like to plot some of this information, however I would need to exclude any data points with "#N/A" by removing any rows where there is an "#N/A" text string (produced by MS Excel spreadsheet exported to CSV) to make a plot of Salary ~ CompanyExperience.
My code to subset is as follows:
salaries <-salaries[salaries$CompanyExperience!="#N/A" &
salaries$Salary!="#N/A",]
#write.csv(salaries, "salaries2.csv")
#salaries <- read.csv('salaries2.csv', header=TRUE)
print(salaries)
Now this seems to work without any issue, producing:
Name Job Salary CompanyExperience IndustryExperience
1 John Engineer 50000 3 12
2 Adam Manager 55000 6 7
4 Bob Engineer 65000 5 #N/A
Which seems fine, however as soon as I try to put this data subset into a linear regression, I get an error:
> salarylinear <- lm(salaries$CompanyExperience ~ salaries$Salary)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
Now if I've done some experimenting and have found that if I subset the data using things like "!=10000" or "<50", I dont get this error. Also, I've found that when I write this new subset into a CSV file and read it again (by removing the # tags in the code above, the data set will have added a mysterious "X" column at the front and wont have the error when trying to run a linear regression:
X Name Job Salary CompanyExperience IndustryExperience
1 1 John Engineer 50000 3 12
2 2 Adam Manager 55000 6 7
3 4 Bob Engineer 65000 5 #N/A
I've searched the web and cant find any reason why this is happening. Is there a way I can produce a useable subset by excluding "#N/A" strings without having to resort to writing the data to disk and reading into memory again?

Most likely what is happening is that columns of data that you think are numeric are not in fact numeric. Two things are leading to this:
read.csv() doesn't know that "#N/A" means "missing" and as a result, it is reading in "#N/A" as a string (not a number), causing it to think that the whole columns of Salary, CompanyExperience, and IndustryExperience are string variables.
read.csv() has a notorious default to read in strings as factors. If you're unfamiliar with factors, one good resource is this.
This combination of events is why lm() thinks your dependent variable is a factor and is throwing an error.
The solution is to add na.strings = "#N/A" as an argument to read.csv(). Then your data will be read in as numeric. You can proceed straight to running your regression because lm() will drop rows with NA's automatically.
However, to be a bit more explicit, you may also want to add stringsAsFactors = FALSE as an argument to read.csv() just in case you have any other things that mean "missing" but are coded as, say, a blank. And, if you want to handle the NAs manually before running your regression, you can drop rows with NAs using complete.cases() or something like salaries[!is.na(Salary),]

Follow-up to our discussion in the comments about what happens when you subset a data.frame with a matrix:
First, we create a 3x2 dataframe to work with:
df <- data.frame(x=1:3, y=4:6)
Then, let's create a vector of TRUE/FALSE for the rows we want to keep when we subset our dataframe.
v <- c(T,T,F)
Here, v has 2 TRUEs followed by 1 FALSE so if we subset our 3-row dataframe with v, we will be selecting the first 2 rows and omitting the 3rd row:
df[v,]
x y
1 1 4
2 2 5
Great, that works as expected. But what about if we subset with a matrix? We create matrix m that has the same 3x2 dimensions as our dataframe. m is full of TRUEs except for 2 FALSEs in cells (1,1) and (3,2).
m <- matrix(c(F,T,T,T,T,F), ncol=2)
m
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE TRUE
[3,] TRUE FALSE
Now, if we try to subset our dataframe with m, we might at first think that we're gong to only get row 2 back, because m has a FALSE in its first and third row. That, of course, isn't what happens.
df[m,]
x y
2 2 5
3 3 6
NA NA NA
NA.1 NA NA
The trick to understanding this is to know that a matrix in R is just a vector with a dimension attribute. The dimension is as expected, because we created m:
dim(m)
[1] 3 2
But as a vector, what does m look like:
as.vector(m)
[1] FALSE TRUE TRUE TRUE TRUE FALSE
We see that m-as-a-vector is just the columns of m, repeated one after the other (because R "fills in" matrices column-wise). Let me re-write m with the original cells identified, in case my description isn't clear:
[1] FALSE TRUE TRUE TRUE TRUE FALSE
(1,1) (2,1) (3,1) (1,2) (2,2) (3,2)
So when we try to subset our dataframe with m, it's like using this length-6 vector, and this length-6 vector says to select rows 2:5. So when we write df[m, ] R faithfully selects rows 2 and 3, and then when it tries to select rows 4 and 5, they don't "exist" so R fills them in with NAs. This is why we get more rows in our subset than in our original dataframe.
Lastly, we saw that df[m, ] has funny rownames like NA.1. Rownames must be unique, so R calls the row 4 of the "subset" 'NA' and it calls row 5 of the subset 'NA.1'.
I hope this clears it up for you. Happy coding!

Replacing N/A with other value in data.table returns error

I want to replace NA with other value for specific column in data.table.
I tried below link but some error shows up.
How to replace NA values in a table for selected columns? data.frame, data.table
The codes that I use are
df<-data.table(aa<-(1:4),ba<-c(NA,1,3,4),ca<-c(NA,"2012-01-02","2012-02-02","2012-03-02"))
df[is.na(get(ca)),(ca):="2012-04-01"]
I got the error message: Error in get(c) : object 'NA' not found
But if I use
df[is.na(ca),(ca):="2012-04-01"]
It returns results that I don't want.
Can anyone help me?
Thanks

If we use the correct column names, it would work, and we don't need get.
df[is.na(ca), ca:= "2012-04-01"]
df
# aa ba ca
#1: 1 NA 2012-04-01
#2: 2 1 2012-01-02
#3: 3 3 2012-02-02
#4: 4 4 2012-03-02
Within the data.table call, we use = instead of <-. In addition, as #Frank mentioned, assigning (ca) and ca are different as the former could be a vector of strings that can be used to create names for new columns.
data
df<-data.table(aa=(1:4),ba=c(NA,1,3,4),
ca=c(NA,"2012-01-02","2012-02-02","2012-03-02"))

Rbind() doesn't work with character data with different names

I have tried to add a row to an existing dataset which I read into R from a csv file.
The dataset looks like this:
Format PctShare
1 NewsTalk 12.6
2 Country 12.5
3 AdultContemp 8.2
4 PopHit 5.9
5 ClassicRock 4.7
6 ClassicHit 3.9
7 RhythmicHit 3.7
8 UrbanAdult 3.6
9 HotAdult 3.5
10 UrbanContemp 3.3
11 Mexican 2.9
12 AllSports 2.5
After naming the dataset "share", I tried to add a 13th row to it by using this code:
totalshare <- rbind(share, c("Others", 32.7)
--> which didn't work and gave me this warning message:
Warning message:In`[<-.factor`(`*tmp*`, ri, value = "Others"):invalid factor level, NA generated
However, when I tried entering a row with an existing character value ("AllSports") in the dataset with this code:
rbind(share, c("AllSports", 32.7))
--> it added the row perfectly
I am wondering whether I need to tell R that there is a new character value under the column "Format" before I bind the new row to R?

Your format columns is a factor variable. Look at str(share), str(share$format), class(share$format) and levels(share$format) for more information. The reason rbind(share, c("AllSports", 32.7) worked is because "AllSports" is already an existing factor level for the format variable.
To fix the issue, convert the format column to character via:
share$format <- as.character(share$format)
Do some searches on factor variables and setting factor levels to learn more. Moreover, when you are reading in the file from csv, you can force any character strings to not convert to factors with the option, stringsAsFactors = FALSE -- for example, share <- read.csv(myfile.csv, stringsAsFactors = FALSE).

Two solution I have in mind
Solution 1:-
before reading data
options(stringsAsFactors = F)
or
Solution 2:-
as suggested by #JasonAizkalns

Calculate a value in a column for each row

Here is the table that I am trying to manipulate:
colnames sampA sampB
#1 conA conB
#2 1.1 4.4
#3 2.2 5.5
#4 3.3 6.6
I want to calculate log2(x(1-x)) for each number in $sampB. Here is my code so far:
DF[-1,3] <- apply(DF[-1,]$sampB,1,function(x) log2(x(1-x)))
then I got the error message:
dim(X) must have a positive length

You shouldn't need apply(), as log2() is vectorized. Try this
x <- as.numeric(as.character(DF$sampB[-1]))
log2(x * (1 - x))
I took off the first element because I'm not really sure what that conB part is about (and now you have confirmed it in the comments). I also suspect that the column might be a factor (because of conB), so I wrapped the column in as.numeric(as.character(...)). That may not be necessary, but better safe than sorry.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Set first column as rowname, in spite of duplicates - r

Related

How can I append a dataframe and add a row with repeated measure column info within the same dataframe

R subset exclusion based on string creates extra column

Replacing N/A with other value in data.table returns error

Rbind() doesn't work with character data with different names

Calculate a value in a column for each row

Categories

Resources