I have a data frame which I populate from a csv file as follows (data for sample only) :
> csv_data <- read.csv('test.csv')
> csv_data
gender country income
1 1 20 10000
2 2 20 12000
3 2 23 3000
I want to convert country to factor. However when I do the following, it fails :
> csv_data[,2] <- lapply(csv_data[,2], factor)
Warning message:
In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
provided 3 variables to replace 1 variables
However, if I convert both gender and country to factor, it succeeds :
> csv_data[,1:2] <- lapply(csv_data[,1:2], factor)
> is.factor(csv_data[,1])
[1] TRUE
> is.factor(csv_data[,2])
[1] TRUE
Is there something I am doing wrong? I want to use lapply since I want to programmatically convert the columns into factors and it could be possible that the number of columns to be converted is only 1(it could be more as well, this number is driven from arguments to a function). Any way I can do it using lapply only?
When subsetting for one single column, you'll need to change it slightly.
There's a big difference between
lapply(df[,2], factor)
and
lapply(df[2], factor)
## and/or
lapply(df[, 2, drop=FALSE], factor)
Have a look at the output of each. If you remove the comma, everything should work fine. Using the comma in [,] turns a single column into a vector and therefore each element in the vector is factored individually. Whereas leaving it out keeps the column as a list, which is what you want to give to lapply in this situation. However, if you use drop=FALSE, you can leave the comma in, and the column will remain a list/data.frame.
No good:
df[,2] <- lapply(df[,2], factor)
# Warning message:
# In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
# provided 3 variables to replace 1 variables
Succeeds on a single column:
df[,2] <- lapply(df[,2,drop=FALSE], factor)
df[,2]
# [1] 20 20 23
# Levels: 20 23
On my opinion, the best way to subset data frame columns is without the comma. This also succeeds:
df[2] <- lapply(df[2], factor)
df[[2]]
# [1] 20 20 23
# Levels: 20 23
Related
HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10
I am working with R. I found this link here on creating empty data frames in R: Create an empty data.frame .
I tried to do something similar:
df <- data.frame(Date=as.Date(character()),
country=factor(),
total=numeric(),
stringsAsFactors=FALSE)
Yet, when I try to populate it:
df$total = 7
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, total, value = 7) :
replacement has 1 row, data has 0
df[1, "total"] <- rnorm(100,100,100)
Error in `[<-.data.frame`(`*tmp*`, 1, "total", value = c(-79.4584309347689, :
replacement has 100 rows, data has 1
Does anyone know how to fix this error?
Thanks
An option is to specify the row index
df[1, "total"] <- 7
-output
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ Date : Date, format: NA
# $ country: Factor w/ 0 levels: NA
# $ total : num 7
The issue is that when we select a single column and assign on a 0 row dataset, it is not automatically expanding the row for other columns. By specifying the row index, other columns will automatically filled with default NA
Regarding the second question (updated), a standard data.frame column is a vector and the length of the vector should be the same as the index we are specifying. Suppose, we want to expand to 100 rows, change the index accordingly
df[1:100, "total"] <- rnorm(100, 100, 100) # length is 100 here
dim(df)
#[1] 100 3
Or if we need to cram everything in a single row, then wrap the rnorm in a list
df[1, "total"] <- list(rnorm(100, 100, 100))
In short, the lhs should be of the same length as the rhs. Another case is when we are assigning from a different dataset
df[seq_along(aa$bb), "total"] <- aa$bb
This can also be done without initialization i.e.
df <- data.frame(total = aa$bb)
I've figured out if I use as.character(df[x,y]) or as.<whatever>df[x,y] I can get/coerce what I need, every time from my data frames
What I cant seem to find/figure out is why. Details below.
When I access df[1,1] (or anything in column 1) I get
df[1,1]
[1] a
Levels: a b c
but when I access 1,3 it works fine
> df[1,3]
[1] 10
but then when I use as.character() it works.
> as.character(df[1,1])
[1] "a"
The data frame was built using this line
df = data.frame(names = c("a","b","c"), size = c(1,2,3),num = c(10,20,30) )
> df
names size num
1 a 1 10
2 b 2 20
3 c 3 30
But in this data frame
imp2met = read.csv('tomet.csv', header = TRUE, sep=",",dec='.')
> imp2met
unit mult ret
1 (yd) 0.9100 (m)
2 (in) 2.5200 (cm)
3 .....
I get these results for 1,3
> imp2met[1,3]
[1] (m)
Levels: (c) (cm) (cm^2) ....
>
> as.character(imp2met[1,3])
[1] "(m)"
So why the "random" results? Why do I need as.<whatever>() but only some of the time?
data.frame default is to convert character vectors to factors. You can change this with the argument stringsAsFactors=FALSE
Also, when you subset a dataframe using [, you can add the drop=FALSE argument to simplify the results in some cases.
the Excel csv data file (called ff) has 54 columns & 788 rows of normalized data between 0 & 1, that looks like this: 0.39 0.16 0.27 0.60 ...
> str(ff)
'data.frame': 788 obs. of 54 variables:
$ V1 : Factor w/ 66 levels " - "," 0.05 ",..: 25 36 33 44 36 37 39 20
> dd <- as.numeric(as.character(ff))
Warning message:
NAs introduced by coercion
> dd <- gsub(".","",ff)
> de <- as.numeric(as.character(dd))
> str(de)
num [1:54] NA NA NA NA NA NA NA NA NA NA ...
I'm at a loss. I saw that lots of folks (perhaps beginners like me) have posted somewhat similar questions, please accept my apologies for raising this matter again. My gratitude in advance for your suggestions.
I think one problem you're having is that you're running the as.numeric(as.character(.)) call on an entire data frame, rather than a particular column. The result is a vector, the length of which equals the number of columns in your data frame (note your output is a vector of length 54, rather than 788 like you'd be hoping for from a column of your original data frame). Here's why:
When you convert a data frame to character, you get a vector back:
df <- data.frame( V1 = c(1,2,3), V2 = c(4,5,6) )
as.character( df )
[1] "c(1, 2, 3)" "c(4, 5, 6)"
Note that each vector element is not a character vector (ie: c("1","2","3")), but is in fact the vector representing that column, converted to a character string (ie: "c(1, 2, 3)"). So when you apply as.numeric to that vector, you'll get a vector back (not a data frame), and since each element can't be converted to a number (or even a numeric vector), you get NAs back:
as.numeric( as.character( df ) )
[1] NA NA
What you're more likely looking for is a conversion for a single column, rather than the entire data frame. Try:
ff$V1 <- as.numeric( as.character( ff$V1 ) )
This way you're converting a vector to a vector, which should give you the result you're after. You can do this over every column using lapply, something like:
df <- lapply( df, function(x) as.numeric( as.character( x ) ) )
df <- as.data.frame( df )
(or better yet, set the colClasses when you read the file as per #s.brunel's comment, such that you don't need to worry about this conversion at all)
NOTE also #akrun's comment. You should expect a warning when converting a vector where some values can't be converted to the class you're wanting. In your case, you've got some " - " values, which can't be converted to numeric, so you'll get NAs in place of those.
I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.
example:
a<-data.frame(one = numeric(0), two = numeric(0))
a
#[1] one two
#<0 rows> (or 0-length row.names)
names(a)
#[1] "one" "two"
a<-rbind(a, c(5,6))
a
# X5 X6
#1 5 6
names(a)
#[1] "X5" "X6"
As you can see, the column names one and two were replaced by X5 and X6.
Could somebody please tell me why this happens and is there a right way to do this without losing column names?
A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.
Thanks
Context:
I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.
The rbind help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.
was almost surrendering to this issue.
1) create data frame with stringsAsFactor set to FALSE or you run straight into the next issue
2) don't use rbind - no idea why on earth it is messing up the column names. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df <- data.frame(a = character(0), b=character(0), c=numeric(0))
df[nrow(df)+1,] <- c("d","gsgsgd",4)
#Warnmeldungen:
#1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
# invalid factor level, NAs generated
#2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
# invalid factor level, NAs generated
df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df
# a b c
#1 d gsgsgd 4
Workaround would be:
a <- rbind(a, data.frame(one = 5, two = 6))
?rbind states that merging objects demands matching names:
It then takes the classes of the
columns from the first data frame, and
matches columns by name (rather than
by position)
FWIW, an alternative design might have your functions building vectors for the two columns, instead of rbinding to a data frame:
ones <- c()
twos <- c()
Modify the vectors in your functions:
ones <- append(ones, 5)
twos <- append(twos, 6)
Repeat as needed, then create your data.frame in one go:
a <- data.frame(one=ones, two=twos)
One way to make this work generically and with the least amount of re-typing the column names is the following. This method doesn't require hacking the NA or 0.
rs <- data.frame(i=numeric(), square=numeric(), cube=numeric())
for (i in 1:4) {
calc <- c(i, i^2, i^3)
# append calc to rs
names(calc) <- names(rs)
rs <- rbind(rs, as.list(calc))
}
rs will have the correct names
> rs
i square cube
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
>
Another way to do this more cleanly is to use data.table:
> df <- data.frame(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are messed up
> X1 X2
> 1 1 2
> df <- data.table(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are preserved
a b
1: 1 2
Notice that a data.table is also a data.frame.
> class(df)
"data.table" "data.frame"
You can do this:
give one row to the initial data frame
df=data.frame(matrix(nrow=1,ncol=length(newrow))
add your new row and take out the NAS
newdf=na.omit(rbind(newrow,df))
but watch out that your newrow does not have NAs or it will be erased too.
Cheers
Agus
I use the following solution to add a row to an empty data frame:
d_dataset <-
data.frame(
variable = character(),
before = numeric(),
after = numeric(),
stringsAsFactors = FALSE)
d_dataset <-
rbind(
d_dataset,
data.frame(
variable = "test",
before = 9,
after = 12,
stringsAsFactors = FALSE))
print(d_dataset)
variable before after
1 test 9 12
HTH.
Kind regards
Georg
Researching this venerable R annoyance brought me to this page. I wanted to add a bit more explanation to Georg's excellent answer (https://stackoverflow.com/a/41609844/2757825), which not only solves the problem raised by the OP (losing field names) but also prevents the unwanted conversion of all fields to factors. For me, those two problems go together. I wanted a solution in base R that doesn't involve writing extra code but preserves the two distinct operations: define the data frame, append the row(s)--which is what Georg's answer provides.
The first two examples below illustrate the problems and the third and fourth show Georg's solution.
Example 1: Append the new row as vector with rbind
Result: loses column names AND coverts all variables to factors
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
c("Bob", 250)
)
my.df
X.Bob. X.250.
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ X.Bob.: Factor w/ 1 level "Bob": 1
$ X.250.: Factor w/ 1 level "250": 1
Example 2: Append the new row as a data frame inside rbind
Result: keeps column names but still converts character variables to factors.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : Factor w/ 1 level "Bob": 1
$ score: num 250
Example 3: Append the new row inside rbind as a data frame, with stringsAsFactors=FALSE
Result: problem solved.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250, stringsAsFactors=FALSE)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : chr "Bob"
$ score: num 250
Example 4: Like example 3, but adding multiple rows at once.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(
name=c("Bob", "Carol", "Ted"),
score=c(250, 124, 95),
stringsAsFactors=FALSE)
)
str(my.df)
'data.frame': 3 obs. of 2 variables:
$ name : chr "Bob" "Carol" "Ted"
$ score: num 250 124 95
my.df
name score
1 Bob 250
2 Carol 124
3 Ted 95
Instead of constructing the data.frame with numeric(0) I use as.numeric(0).
a<-data.frame(one=as.numeric(0), two=as.numeric(0))
This creates an extra initial row
a
# one two
#1 0 0
Bind the additional rows
a<-rbind(a,c(5,6))
a
# one two
#1 0 0
#2 5 6
Then use negative indexing to remove the first (bogus) row
a<-a[-1,]
a
# one two
#2 5 6
Note: it messes up the index (far left). I haven't figured out how to prevent that (anyone else?), but most of the time it probably doesn't matter.