I have a vector of character data with repeat values. My ultimate goal is to create a bar plot displaying the frequency at which each unique value occurs in the vector. A long way of doing it would be as follows:
object1=length(df$vector[df$vector=="object1"])
object2=length(df$vector[df$vector=="object2"])
object3=length(df$vector[df$vector=="object3"])
amounts=c(object1,object2, object3)
barplot(amounts)
This works but is cumbersome when there are many unique values, which indicates to me that a loop could be used. I know I can get a vector of the unique values in the original vector via the "unique()" command, but I'm not sure where to go from there. The following posts have made me think, but weren't able to answer my question.
Counting the number of elements with the values of x in a vector
R for loop on character variables
you could use ggplot.
Installation:
install.packages('ggplot2')
load library:
library(ggplot2)
Plot Barplot:
ggplot(df,aes(x=as.factor(vector)))+geom_bar()
If your vector is numeric, the as.factor() function can help to change it into categorical.
Related
good day
I don´t understand a topic here, is like it works but I can´t understand why
I have this database
# planets_df is pre-loaded in your workspace
# Use order() to create positions
positions <- order(planets_df$diameter)
positions
# Use positions to sort planets_df
planets_df[positions,]
I don´t understand why if u take the column diameter, then if u want to order it why u put it in a row of the dataframe like for me it should be [ rows, colum] but u put a column in a row and it changes, I really don´t get that.Why it´s not planets_df[,positions].
The exercise is solved I just don´t get it, is a data camp exercise btw.
Sorry if my English is wrong, it is not my native language.
I believe that I have created an example that matches your description. For the mtcars data set, which is pre-loaded in any R session, we can sort based on the variable mpg.
The function order returns the row indices sorted by mpg in this case. The ordering variable indicates the order that the rows should be presented in by storing the row indices based on mpg.
ordering <- order(mtcars$mpg)
This next step indicates that we want the rows of mtcars as specified by ordering. Essentially ordering is the order of the rows we want and so we pass that object to the row portion the call to mtcars.
mtcars[ordering,]
If we instead passed ordering as the columns, we would be reordering the columns of mtcars instead of the rows.
In a simulation I produce one very large vector of numbers, which I want to show in a histogram. Unfortunately, my RAM doesn't allow vectors as long as I require them to be. (10^10 entries)
Thus, I put my simulation in a loop producing several smaller vectors of shorter length.
It tried the hist-function and the summation of hist$counts, however the binning keeps changing, which makes a summation impossible(for me...)
Now, I search a soultion to handle these smaller vectors, in sequential way.
read the frist vector (from the loop)
extract information for a histogram
keep the histogram information of the 1st but discard the vector itself to safe memory
do this for all the other vectors and store only the histogramm of all vectors.
build one histogram where the accumulated histogram information are added up to one set of information.
Can any one help out? Is this possible in R ? I'm stuck... Thanks to all who took time to read this !
Your problem, if I understand correctly, is that the histogram bins are changing. So the natural solution would be to fix the bins using the breaks parameter of the hist function. For better performance you can set plot = FALSE and just collect the bin counts from each part.
You can obtain the information an histogram will require with the function count() of the library dplyr.
Let's say the values of vector of numbers range from 1 to 100. First you have to define your buckets : 1-10, 11-20, ...
Then, within the loop and with a smaller vector, use the function cut() with the arguments breaks = to transform your numeric vector to a categorical vector. Use count to count the numbers of values in each buckets.
At the end of your loop, combine all the counts you obtain.
I have a data frame consisting of five character variables which represent specific bacteria. I then have thousands of observations of each variable that all begin with the letter K. eg
x <- c(K0001,K0001,K0003,K0006)
y <- c(K0001,K0001,K0002,K0003)
z <- c(K0001,K0002,K0007,K0008)
r <- c(K0001,K0001,K0001,K0001)
o <- c(K0003,K0009,K0009,K0009)
I need to identify unique observations in the first column that don't appear in any of the remaining four columns. I have tried the approach suggested here which I think would work if I could create individual vectors using select ...
How to tell what is in one vector and not another?
but when I try to create a vector for analysis using the code ...
x <- select(data$x)
I get the error
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character
I have tried to mutate the vectors using as.factor and as.numeric but neither of these approaches work as the first gives an equivalent error as above, and as.numeric returns NAs.
Thanks in advance
The reference that you cited recommended using setdiff. The only thing that you need to do to apply that solution is to convert the four columns into one, so that it can be treated as a set. You can do that with unlist
setdiff(data$x, unlist(data[,2:5]))
"K0006"
I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.
I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric