Related
I've figured out if I use as.character(df[x,y]) or as.<whatever>df[x,y] I can get/coerce what I need, every time from my data frames
What I cant seem to find/figure out is why. Details below.
When I access df[1,1] (or anything in column 1) I get
df[1,1]
[1] a
Levels: a b c
but when I access 1,3 it works fine
> df[1,3]
[1] 10
but then when I use as.character() it works.
> as.character(df[1,1])
[1] "a"
The data frame was built using this line
df = data.frame(names = c("a","b","c"), size = c(1,2,3),num = c(10,20,30) )
> df
names size num
1 a 1 10
2 b 2 20
3 c 3 30
But in this data frame
imp2met = read.csv('tomet.csv', header = TRUE, sep=",",dec='.')
> imp2met
unit mult ret
1 (yd) 0.9100 (m)
2 (in) 2.5200 (cm)
3 .....
I get these results for 1,3
> imp2met[1,3]
[1] (m)
Levels: (c) (cm) (cm^2) ....
>
> as.character(imp2met[1,3])
[1] "(m)"
So why the "random" results? Why do I need as.<whatever>() but only some of the time?
data.frame default is to convert character vectors to factors. You can change this with the argument stringsAsFactors=FALSE
Also, when you subset a dataframe using [, you can add the drop=FALSE argument to simplify the results in some cases.
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
I have a character vector in R with 330000 values e.g.
amp184660
amp947
amp53303
amp364886
amp121615
and and a data frame like this:
I want to find each value from my character vector in first column of the data frame i.e. "Assay Name" and then output its corresponding chromosome position i.e "Chrom" into a new vector. I want to do this as quickly as possible as there are about 330k entries and doing this via grep over a loop will take about 12 hours to finish.
Any ideas?
Thanks
Jason.
I would suggest %in%, which is likely to be faster than merge. Here's a toy example:
## Assume that "x" is your data.frame
set.seed(1)
x <- data.frame(Assay = sample(letters, 30, replace = TRUE),
Chrom = 4, ChromPos = rnorm(30))
## And that "y" is your vector you want to match
y <- c("a", "b", "c", "d", "e")
## Here's how you can use %in%
x[x$Assay %in% y, ]
# Assay Chrom ChromPos
# 10 b 4 0.6198257
# 12 e 4 -0.1557955
# 24 d 4 1.1000254
# 27 a 4 -0.2533617
## And can also directly extract a specific column
x[x$Assay %in% y, "ChromPos"]
# [1] 0.6198257 -0.1557955 1.1000254 -0.2533617
# assume your df called your_data_frame and vector called your_character_vector
vector_frame<-data.frame("Assay Name"=your_character_vector)
merge(vector_frame,your_data_frame,by="Assay Name")[,3]
note I changed the column notation from $Chrom to [,3] because I saw you wanted the third column and R will rename the column in the $ call e.g. to Chrom.Pos..bp. or something similar - if you type the $ and press TAB in the RStudio editor it'll give you the options
Just in case runtime is still a problem, using the data.table package is approx. 100x faster than merge and 50x faster than %in%:
library(data.table)
dt <- as.data.table( yourDataFrame )
setkey( dt, Assay )
dt[ J(yourVector) ]
Is there a way to assign a value to a specific column within a data frame? e.g.,
dat2 = data.frame(c1 = 101:149, VAR1 = 151:200)
j = "dat2[,"VAR1"]" ## or, j = "dat2[,2]"
assign(j,1:50)
The approach above doesn't work. Neither does this:
j = "dat2"
assign(get(j)[,"VAR1"],1:50)
lets assume that we have a valid data.frame with 50 rows in each
dat2 <- data.frame(c1 = 1:50, VAR1 = 51:100)
1 . Don't use assign and get if you can avoid it.
"dat2[,"VAR1"]" is not valid in R.
You can also note this from the help page for assign
assign does not dispatch assignment methods, so it cannot be used to
set elements of vectors, names, attributes, etc.
Note that assignment to an attached list or data frame changes the
attached copy and not the original object: see attach and with.
A column of a data.frame is an element of a list
What you are looking for is [[<-
# assign the values from column (named element of the list) `VAR1`
j <- dat2[['VAR1']]
If you want to assign new values to VAR1 within dat2,
dat2[['VAR1']] <- 1:50
The answer to your question....
To manipulate entirely using character strings using get and assign
assign('dat2', `[[<-`(get('dat2'), 'VAR1', value = 2:51))
Other approaches
data.table::set
if you want to assign by reference within a data.frame or data.table (replacing an existing column only) then set from the data.table package works (even with data.frames)
library(data.table)
set(dat2, j = 'VAR1', value = 5:54)
eval and bquote
dat1 <- data.frame(x=1:5)
dat2 <- data.frame(x=2:6)
for(x in sapply(c('dat1','dat2'),as.name)) {
eval(bquote(.(x)[['VAR1']] <- 2:6))
}
eapply
Or if you use a separate environment
ee <- new.env()
ee$dat1 <- dat1
ee$dat2 <- dat2
# eapply returns a list, so use list2env to assign back to ee
list2env(eapply(ee, `[[<-`, 'y', value =1:5), envir = ee)
set2 <- function(x, val) {
eval.parent(substitute(x <- val))
}
> dat2 = data.frame(c1 = 101:150, VAR1 = 151:200)
> set2(dat2[["VAR1"]], 1:50)
> str(dat2)
'data.frame': 50 obs. of 2 variables:
$ c1 : int 101 102 103 104 105 106 107 108 109 110 ...
$ VAR1: int 1 2 3 4 5 6 7 8 9 10 ...
I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.
example:
a<-data.frame(one = numeric(0), two = numeric(0))
a
#[1] one two
#<0 rows> (or 0-length row.names)
names(a)
#[1] "one" "two"
a<-rbind(a, c(5,6))
a
# X5 X6
#1 5 6
names(a)
#[1] "X5" "X6"
As you can see, the column names one and two were replaced by X5 and X6.
Could somebody please tell me why this happens and is there a right way to do this without losing column names?
A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.
Thanks
Context:
I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.
The rbind help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.
was almost surrendering to this issue.
1) create data frame with stringsAsFactor set to FALSE or you run straight into the next issue
2) don't use rbind - no idea why on earth it is messing up the column names. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df <- data.frame(a = character(0), b=character(0), c=numeric(0))
df[nrow(df)+1,] <- c("d","gsgsgd",4)
#Warnmeldungen:
#1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
# invalid factor level, NAs generated
#2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
# invalid factor level, NAs generated
df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df
# a b c
#1 d gsgsgd 4
Workaround would be:
a <- rbind(a, data.frame(one = 5, two = 6))
?rbind states that merging objects demands matching names:
It then takes the classes of the
columns from the first data frame, and
matches columns by name (rather than
by position)
FWIW, an alternative design might have your functions building vectors for the two columns, instead of rbinding to a data frame:
ones <- c()
twos <- c()
Modify the vectors in your functions:
ones <- append(ones, 5)
twos <- append(twos, 6)
Repeat as needed, then create your data.frame in one go:
a <- data.frame(one=ones, two=twos)
One way to make this work generically and with the least amount of re-typing the column names is the following. This method doesn't require hacking the NA or 0.
rs <- data.frame(i=numeric(), square=numeric(), cube=numeric())
for (i in 1:4) {
calc <- c(i, i^2, i^3)
# append calc to rs
names(calc) <- names(rs)
rs <- rbind(rs, as.list(calc))
}
rs will have the correct names
> rs
i square cube
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
>
Another way to do this more cleanly is to use data.table:
> df <- data.frame(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are messed up
> X1 X2
> 1 1 2
> df <- data.table(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are preserved
a b
1: 1 2
Notice that a data.table is also a data.frame.
> class(df)
"data.table" "data.frame"
You can do this:
give one row to the initial data frame
df=data.frame(matrix(nrow=1,ncol=length(newrow))
add your new row and take out the NAS
newdf=na.omit(rbind(newrow,df))
but watch out that your newrow does not have NAs or it will be erased too.
Cheers
Agus
I use the following solution to add a row to an empty data frame:
d_dataset <-
data.frame(
variable = character(),
before = numeric(),
after = numeric(),
stringsAsFactors = FALSE)
d_dataset <-
rbind(
d_dataset,
data.frame(
variable = "test",
before = 9,
after = 12,
stringsAsFactors = FALSE))
print(d_dataset)
variable before after
1 test 9 12
HTH.
Kind regards
Georg
Researching this venerable R annoyance brought me to this page. I wanted to add a bit more explanation to Georg's excellent answer (https://stackoverflow.com/a/41609844/2757825), which not only solves the problem raised by the OP (losing field names) but also prevents the unwanted conversion of all fields to factors. For me, those two problems go together. I wanted a solution in base R that doesn't involve writing extra code but preserves the two distinct operations: define the data frame, append the row(s)--which is what Georg's answer provides.
The first two examples below illustrate the problems and the third and fourth show Georg's solution.
Example 1: Append the new row as vector with rbind
Result: loses column names AND coverts all variables to factors
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
c("Bob", 250)
)
my.df
X.Bob. X.250.
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ X.Bob.: Factor w/ 1 level "Bob": 1
$ X.250.: Factor w/ 1 level "250": 1
Example 2: Append the new row as a data frame inside rbind
Result: keeps column names but still converts character variables to factors.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : Factor w/ 1 level "Bob": 1
$ score: num 250
Example 3: Append the new row inside rbind as a data frame, with stringsAsFactors=FALSE
Result: problem solved.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250, stringsAsFactors=FALSE)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : chr "Bob"
$ score: num 250
Example 4: Like example 3, but adding multiple rows at once.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(
name=c("Bob", "Carol", "Ted"),
score=c(250, 124, 95),
stringsAsFactors=FALSE)
)
str(my.df)
'data.frame': 3 obs. of 2 variables:
$ name : chr "Bob" "Carol" "Ted"
$ score: num 250 124 95
my.df
name score
1 Bob 250
2 Carol 124
3 Ted 95
Instead of constructing the data.frame with numeric(0) I use as.numeric(0).
a<-data.frame(one=as.numeric(0), two=as.numeric(0))
This creates an extra initial row
a
# one two
#1 0 0
Bind the additional rows
a<-rbind(a,c(5,6))
a
# one two
#1 0 0
#2 5 6
Then use negative indexing to remove the first (bogus) row
a<-a[-1,]
a
# one two
#2 5 6
Note: it messes up the index (far left). I haven't figured out how to prevent that (anyone else?), but most of the time it probably doesn't matter.