how to convert large column using factor [duplicate]

how to convert large column using factor [duplicate] - r

This question already has answers here:
Convert data.frame column format from character to factor
(8 answers)
Closed 3 years ago.
Im writing a machine learning code for my dataset having hotels column.The hotel column contains 300 hotels name.For data preprocessing,I saw we have to use factor.Is there any easy way to covert it as there are so many values for level?

It's simple, use the as.factor() function to convert the column form character to factor.
Here's a sample
# Sample data
data
a b
1 A 1
2 B 2
3 C 3
4 A 4
5 B 5
class(data$a)
[1] "character"
# Converting to factor
data$a <- as.factor(data$a)
# Results
class(data$a)
[1] "factor"
summary(data$a)
A B C
2 2 1

if you are using read.csv option to load the csv data into a dataframe, then column having string values are by default loaded as a factor column.
Anyway you can use factor() function to convert a column to factor:
df$a <- factor(df$a).

Related

Viewing single column of data frame in R [duplicate]

This question already has answers here:
How to subset matrix to one column, maintain matrix data type, maintain row/column names?
(1 answer)
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 5 years ago.
I am running a simulation model that creates a large data frame as its output, with each column corresponding to the time-series of a particular variable:
data5<-as.data.frame(simulation3$baseline)
Occasionally I want to look at subsets, especially particular columns, of this data frame in order to get an idea of the output. For this I am using the View-function like so
View(data5[1:100,1])
for instance, if I wish to see the first 100 rows of column 1. Alternatively, I also sometimes do something like this, using the names of the time series:
timeframe=1:100
toAnalyse=c("u","u_n","u_e","u_nw")
View(data5[timeframe,toAnalyse])
In either case, there is an annoying display problem when I am trying to view a single column on its own (as for instance with View(data5[1:100,1])), whereby what I get looks like this:
Example 1
As you can see, the top of the table which would usually contain the name of the variable in the dataset instead contains a string of all values that the variable takes. This problem does not appear if I select 2 or more columns:
Example 2
Does anyone know how to get rid of this issue? Is there some argument that I can feed to View to make sure that it behaves nicely when I ask it to just show a single column?

View(data5[1:100,1, drop=FALSE])
When you access a single column of a data frame it is converted to a vector, drop=FALSE prevents that and retains the column name.
For instance:
> df
n s b
1 2 aa TRUE
2 3 bb TRUE
3 5 cc TRUE
> df[, 1]
[1] 2 3 5
> df[, 1, drop=FALSE]
n
1 2
2 3
3 5

Doing something similar to melt to an R dataframe [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I've got a dataframe like this:
The first column is numeric, and the second column is a comma separated list (character)
id numbers
1 2,4,5
2 1,4,6
3 NA
4 NA
5 5,1,2
And I want to in essence "melt" the dataframe similar to the reshape package. So that the output is a dataframe which looks like this
id numbers
1 2
1 4
1 5
2 1
2 4
2 6
3 NA
4 NA
5 5
5 1
5 2
Except in the reshape2 package each number will have to be each in a column... which takes up too much storage space if there are many numbers... which is why I have opted to set the list of numbers as a comma separated list. But melt no longer works with this setup.
Can you recommend the most efficient way to achieve the transformation from the input dataframe to output dataframe?

The way I would do it for each row, create a data.frame and store them in a list, where df is your initial data.frame.
l = list()
for (j in 1:nrow(df)){
l[[j]] = data.frame(id = df$id[[j]],
numbers = split(df$numbers[[j]], ','))
}
Afterwards, you can stack all list elements into a single data.frame using plyr::ldply with the 'data.frame' option.

r how to subset without retaining all data info from original set? [duplicate]

This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 7 years ago.
I am trying to subset data.
here's the link to sample data to play around with:
https://drive.google.com/file/d/0BwIbultIWxeVOFdRaE81Nm9qc2s/view?usp=sharing
so in this data set, the last column has name "Type", which has 2 values: "normal." and "back."
and let's say i am subsetting based on the "Type" column:
test.data = read.csv(file = paste0(dd, '/data_example.csv'))
test.subdata1 = subset(test.data, test.data$Type == 'normal.')
test.subdata2 = test.data[test.data$Type == 'normal.',]
here, I'm subsetting using two most common methods:
by using subset()
by directly filtering in the []
supposedly, the new subsetted data should only contain data that has Type ``"normal." (there's a period behind the word)
and indeed, when i view the subset data table, there's only "normal." ones present.
HOWEVER, the thing is, the "back." class info is retained in my subsetted data, as shown in following output:
str(test.subdata1$Type)
# Factor w/ 2 levels "back.","normal.": 2 2 2 2 2 2 2 2 2 2 ...
str(test.subdata2$Type)
# Factor w/ 2 levels "back.","normal.": 2 2 2 2 2 2 2 2 2 2 ...
so it does not matter which subsetting method i use, the complete information from the original data set will be retained in my subset data set.
my question is:
HOW to get rid of the extra info from the original data set i do not want to retain in my subset data set?
meaning, how can i see only 1 factor level in my subset data and not 2 factor levels?

# Is this what you need?
test.subdata1$Type = as.factor(as.integer(test.subdata1$Type))
# or maybe
test.subdata1$Type = factor(test.subdata1$Type)

filter R data frame with one column - keep data frame format [duplicate]

This question already has an answer here:
Filtering single-column data frames
(1 answer)
Closed 7 years ago.
I am looking for a simple way to display a subset of a one column data frame
Let's assume, I have a a data frame:
> df <- data.frame(a = 1:100)
Now, I only need the first 10 rows. If I subset it by index, I'll get a result vector instead of a data frame:
> df[1:10,]
[1] 1 2 3 4 5 6 7 8 9 10
I tried to use 'subset' but not using the 'subset'-parameter will result in an error (only for one-column-data-frames?):
subset(df[1:10,])
Error in subset.default(df[1:10, ]) :
argument "subset" is missing, with no default
There should be a very easy solution to achive a subset (still a data frame) filtered by row index, no?
I am lookung for a solution with basic R commands (it should not depend on any special library)

you can use drop=FALSE, which prevent from droping the dimensions of the array.
df[1:10, , drop=FALSE]
a
1 1
2 2
3 3
4 4
5 5
...
For subset you need to add a condition.

Convert columns into multiple rows per entry in R [duplicate]

This question already has answers here:
Convert columns to rows keeping the name of the column
(2 answers)
Closed 9 years ago.
I have the following data:
word Jan-2013 Feb-2013 Mar-2013
A 1 2 3
B 5 2 4
I want to convert the multiple date columns into one, named date and add an additional column for the value.
word date value
A Jan-2013 1
A Feb-2013 2
A Mar-2013 3
B Jan-2013 5
B Feb-2013 2
B Mar-2013 4
Can anyone assist?
Thanks

Additional R options
In addition to Metrics's answer, here are two additional options for R (assuming your data.frame is called "mydf"):
cbind(mydf[1], stack(mydf[-1]))
library(reshape)
melt(mydf, id.vars="word")
Excel option
I am not an Excel user, but since this question is tagged "Excel" as well, I would suggest the Tableau Reshaper Excel add-on.
For your example, it's pretty straightforward:
Go to the "Tableau" menu after installing the add-on and activating it.
Select the cells which contain the values you want to unstack. Click on OK.
View the result.

Using reshape from base R (df1 is your dataframe)
reshape(df1,times=names(df1)[-1],timevar="date",varying=names(df1)[-1],v.names="value",new.row.names=1:6,ids=NULL,direction="long")
word date value
1 A Jan.2013 1
2 B Jan.2013 5
3 A Feb.2013 2
4 B Feb.2013 2
5 A Mar.2013 3
6 B Mar.2013 4