Rbind() doesn't work with character data with different names - r

I have tried to add a row to an existing dataset which I read into R from a csv file.
The dataset looks like this:
Format PctShare
1 NewsTalk 12.6
2 Country 12.5
3 AdultContemp 8.2
4 PopHit 5.9
5 ClassicRock 4.7
6 ClassicHit 3.9
7 RhythmicHit 3.7
8 UrbanAdult 3.6
9 HotAdult 3.5
10 UrbanContemp 3.3
11 Mexican 2.9
12 AllSports 2.5
After naming the dataset "share", I tried to add a 13th row to it by using this code:
totalshare <- rbind(share, c("Others", 32.7)
--> which didn't work and gave me this warning message:
Warning message:In`[<-.factor`(`*tmp*`, ri, value = "Others"):invalid factor level, NA generated
However, when I tried entering a row with an existing character value ("AllSports") in the dataset with this code:
rbind(share, c("AllSports", 32.7))
--> it added the row perfectly
I am wondering whether I need to tell R that there is a new character value under the column "Format" before I bind the new row to R?

Your format columns is a factor variable. Look at str(share), str(share$format), class(share$format) and levels(share$format) for more information. The reason rbind(share, c("AllSports", 32.7) worked is because "AllSports" is already an existing factor level for the format variable.
To fix the issue, convert the format column to character via:
share$format <- as.character(share$format)
Do some searches on factor variables and setting factor levels to learn more. Moreover, when you are reading in the file from csv, you can force any character strings to not convert to factors with the option, stringsAsFactors = FALSE -- for example, share <- read.csv(myfile.csv, stringsAsFactors = FALSE).

Two solution I have in mind
Solution 1:-
before reading data
options(stringsAsFactors = F)
or
Solution 2:-
as suggested by #JasonAizkalns

Related

read_delim( ) from tidyverse cannot directly correct misaligned headers of text file as basic read.table() does

I am trying to use tidyverse read_delim() to read a tab-separated text file.
I can easily use the basic R's read.table() with no problem but when I tested read_delim() with delim = "\t"; I got a problem. For example, I have a file below, "test.txt". As you can see, the header shifts to the right as the first col is row names without a header.
T1 T2 T3
A 1 4 7
B 2 5 8
C 3 6 9
I can use basic R to read this file successfully:
dat <- read.table("test.txt", header=T, sep="\t")
dat
T1 T2 T3
A 1 4 7
B 2 5 8
C 3 6 9
But when I tried to use tidyverse read_delim, I got problems:
dat1 <- read_delim("test.txt", delim ="\t")
Rows: 3 Columns: 3
── Column specification ──────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): T1, T3
dbl (1): T2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning message:
One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
I know basic R's read.table() can automatically correct this problem, but could someone tell me if tidyverse read_delim() has a way to resolve this issue?
Thank you!
-Xiaokuan
The issue isn’t exactly that the headers are misaligned - it’s that readr doesn’t support or recognize row names at all.* readr::read_delim() therefore doesn’t account for the fact that row names don’t have a column header, and just sees three column names followed by four columns of data.
If your goal is to import your data as a tibble, your best bet is probably to use base::read.table(), then tibble::as_tibble(), using the rownames arg to convert the row names to a regular column.
library(tibble)
dat <- read.table("test.txt", header=T, sep="\t")
as_tibble(dat, rownames = "row")
# A tibble: 3 × 4
row T1 T2 T3
<chr> <dbl> <dbl> <dbl>
1 A 1 4 7
2 B 2 5 8
3 C 3 6 9
Another option would be to manually edit your input file to include a column head above the row names.
*This isn’t an oversight, by the way — it’s an intentional choice by the tidyverse team, as they believe row names to be bad practice. e.g., from the tibble docs: “Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column.” Also see this interesting discussion from the tibble github.

Remove index column in read.csv

Inspired by Prevent row names to be written to file when using write.csv, I am curious if there a way to ignore the index column in R using the read.csv() formula. I want to import a text file into an RMarkdown document and don't want the row numbers to show in my HTML file produced by RMarkdown.
Running the following code
write.csv(head(cars), "cars.csv", row.names=FALSE)
produces a CSV that looks like this:
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
But, if you read this index-less file back into R (ie, read.csv("cars.csv")), the index column returns:
. speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
I was hoping the solution would be as easy as including row.names=FALSE to the read.csv() statement, as is done with write.csv(), however after I run read.csv("cars.csv", row.names=FALSE), R gets sassy and returns an "invalid 'row.names' specification" error message.
I tried read.csv("cars.csv")[-1], but that just dropped the speed column, not the index column.
How do I prevent the row index from being imported?
If you save your object, you won't have row names.
x <- read.csv("cars.csv")
But if you print it (to HTML), you will use the print.data.frame function. Which will show row numbers by default. If I use the following (as last line) in my markdown chunk, I didn't have row numbers displayed:
print(read.csv("cars.csv"), row.names = FALSE)
Why?: This problem seems associated with a previous subset procedure that created the data. I have a file that keeps coming back with a pesky index column as I round-trip the data via read/write.csv.
Bottom Line: read.csv takes a file completely and outputs a dataframe, but the file has to be read before any other operation, like dropping a column, is possible.
Easy Workaround: Fortunately it's very simple to drop the column from the new dataframe:
df <- read.csv("data.csv")
df <- df[,-1]

unexpected numeric constant in R after using cast ()

I tried to reshape the data frame that converting the entries in one column to be the row names. Then I use cast () , but I gotta following error when I retrieved the data inside new data frame.
Here is original data frame:
ID Type rating
1 1 3.5
1 2 4.0
2 2 2.5
And the code:
r_mat <-cast(r_data,ID~type)
r_mat$1
unexpected numeric constant in r_mat$1
here is new data frame looks like:
ID 1 2
1 3.5 4.0
2 NA 2.5
Can anyone kindly help me coping with the error ?
Thanks!
You can use make.names in {base} to "Make syntactically valid names out of character vectors" as follows:
colnames(r_mat) <-
make.names(colnames(r_mat),unique=T)
For a set of columns with numeric names, this will insert an "X" character in front of each number, e.g. X1,X2...
For details on the function specification, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/make.names.html

Set first column as rowname, in spite of duplicates

sample
Symobls IDs Value1 Value2 Value3
1 NA NA 3.1 2.3 1.7
2 TP53 1234 5.8 6.9 10.1
3 Kras 5678 0.1 0.3 0.5
4 NA NA 10.3 2.1 7.9
5 Hras 9991 20.0 30.0 40.0
6 TP53 1234 -3.1 0.2 1.7
My table looks like this one.
I need to calculate values by row instead or column.
So, I tried to Use Symbols as new row names. In this way, I can calculate whole row value by using sample[,"Hras"]
When tried to do this, I encountered this problem.
rownames(sample)<-sample[,1]
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘A1CF’, ‘A2M’, ‘A2ML1’, ‘AAGAB’, ‘AAK1’, ‘AAMDC’, ‘AARS2’, ‘AASDH’, ‘AASDHPPT’, ‘AASS’, ‘ABAT’, ‘ABCA1’, ‘ABCA13’, ‘ABCA2’, ‘ABCA4’, ‘ABCA5’, ‘ABCA8’, ‘ABCA9’, ‘ABCB1’, ‘ABCB11’, ‘ABCB4’, ‘ABCB5’, ‘ABCB6’, ‘ABCB8’, ‘ABCB9’, ‘ABCC1’, ‘ABCC10’, ‘ABCC11’, ‘ABCC12’, ‘ABCC13’, ‘ABCC3’, ‘ABCC4’, ‘ABCC5’, ‘ABCC6’, ‘ABCC8’, ‘ABCC9’, ‘ABCD3’, ‘ABCD4’, ‘ABCE1’, ‘ABCF2’, ‘ABCG1’, ‘ABHD1’, ‘ABHD10’, ‘ABHD11’, ‘ABHD12’, ‘ABHD13’, ‘ABHD17B’, ‘ABHD2’, ‘ABHD5’, ‘ABHD6’, ‘ABI1’, ‘ABI2’, ‘ABI3BP’, ‘ABL2’, ‘ABLIM1’, ‘ABLIM2’, ‘ABO’, ‘ABR’, ‘ABRA’, ‘ABTB1’, ‘ABTB2’, ‘ACAA1’, ‘ACAA2’, ‘ACACA’, ‘ACACB’, ‘ACAD10’, ‘ACADL’, ‘ACADSB’, ‘ACAN’, ‘ACAP1’, ‘ACAP2’, ‘ACAP3’, ‘ACAT1’, �� [... truncated]
Is this because of the "NA"? Other options?
Thanks
This is a microarray dataset. I have done normalization and going to extract values of several genes to perform plot, cross-correlation and t-test. In fact, not only NA but several genes that I am going to use for plotting figures have multiple rows. So, I need to extract them into another table for later use.
Here, I am just answering a way to change the row.names as you requested in the question. The ultimate goal is not clear. For the specified problem, you could try using make.names with option unique=TRUE. This will make sure that duplicates are named differently. In the first column, there are NA values, which will be named as NA., NA..1 etc.. (if that is okay for you).
row.names(sample) <- make.names(sample[,1],TRUE)
Or as commented by #Richard Scriven,
row.names(sample) <- paste(make.unique(df[,1]))
Another option would be to convert data.frame to matrix (which will permit duplicate values). I would recommend this only if the columns are of the same class. For example, if you have character and numeric columns, this will convert all the columns to character class. In your dataset, it seems to me that except the first column, all others are numeric (with the possible exception of "IDs" column). But again the NA values would be a problem. If you want to subset the '1st' or '3rd' row based on the rownames, it will be difficult.
sample1 <- as.matrix(sample[,-1])
row.names(sample1) <- sample[,1]
sample1['Hras',]
# IDs Value1 Value2 Value3
# 9991 20 30 40

obtain value from string in column headers in R

I have a text file that looks like the following
DateTime height0.1 height0.2
2009-01-01 00:00 1 1
2009-01-02 00:00 2 4
2009-01-03 00:00 10 1
Obviously this is just an example and the actual file contains a lot more data i.e. contains about 100 column, and the header can have values in decimals. I can read the file into R with the following:
dat <- read.table(file,header = TRUE, sep = "\t")
where file is the path of the table. This creates a data.frame in the workspace called dat. I would now like to generate a variable from this data.frame called 'vars' which is an array made up of the numbers in the column headers (except from DateTime which is the first column).
for example, here I would have vars = 1,2
Basically I want to take the number that is in the string of the header and then store this in a separate variable. I realize that this will be extremely easy for some, but any advice would be great.
If all the numbers you've are at the end of the names, for example, not like h984mm19, then, you can just remove everything except numbers and punctuations using gsub and convert it to numeric vector as follows:
# just give all names except the first column
my_var <- as.numeric(gsub("[^0-9[:punct:]]", "", names(dat)[-1]))
# [1] 0.1 0.2

Resources