I have a .csv file with data like this:
RI Na Mg Al Si K Ca Ba Fe Type
1 1.51793 12.79 3.50 1.12 73.03 0.64 8.77 0.00 0.00 BWF
2 1.51643 12.16 3.52 1.35 72.89 0.57 8.53 0.00 0.00 VWF
3 1.51793 13.21 3.48 1.41 72.64 0.59 8.43 0.00 0.00 BWF
4 1.51299 14.40 1.74 1.54 74.55 0.00 7.59 0.00 0.00 TBL
5 1.53393 12.30 0.00 1.00 70.16 0.12 16.19 0.00 0.24 BWNF
6 1.51655 12.75 2.85 1.44 73.27 0.57 8.79 0.11 0.22 BWNF
I want to create histograms for the distribution of each of the columns.
I've tried this:
data<-read.csv("glass.csv")
names<-(attributes(data)$names)
for(name in names)
{
dev.new()
hist(data$name)
}
But i keep getting this error: Error in hist.default(data$name) : 'x' must be numeric
I'm assuming that this error is because attributes(data)$names returns a set of strings, "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe" "Type"
But I'm unable to convert them to the necessary format.
Any help is appreciated!
You were close. I think you were also trying to get Type at the end.
data<-read.csv("glass.csv")
# names<-(attributes(data)$names)
names<-names(data)
classes<-sapply(data,class)
for(name in names[classes == 'numeric'])
{
dev.new()
hist(data[,name]) # subset with [] not $
}
You could also just loop through the columns directly:
for (column in data[class=='numeric']) {
dev.new()
hist(column)
}
But ggplot2 is designed for multiple plots. Try it like this:
library(ggplot2)
library(reshape2)
ggplot(melt(data),aes(x=value)) + geom_histogram() + facet_wrap(~variable)
Rather than drawing lots of histograms, a better solution is to draw one plot with histograms in panels.
For this, you'll need the reshape2 and ggplot2 packages.
library(reshape2)
library(ggplot2)
First, you'll need to convert your data from wide to long form.
long_data <- melt(data, id.vars = "Type", variable.name = "Element")
Then create a ggplot of the value argument (you can change the name of this by passing value.name = "whatever" in the call to melt above) with histograms in each panel, split by each element.
(histograms <- ggplot(long_data, aes(value)) +
geom_histogram() +
facet_wrap(~ Element)
)
hist(data$name) looks for a column named name, which isn't there. Use hist(data[,name]) instead.
Related
I need to merge two lists with each other but I am not getting what I want and I think it is because the "Date" column is in two different formats. I have a list called li and in this list there are 12 lists each with the following format:
> tail(li$fxe)
Date fxe
3351 2020-06-22 0.0058722768
3352 2020-06-23 0.0044256216
3353 2020-06-24 -0.0044998220
3354 2020-06-25 -0.0027309539
3355 2020-06-26 0.0002832672
3356 2020-06-29 0.0007552346
I am trying to merge each of these unique lists with a different list called factors which looks like :
> tail(factors)
Date Mkt-RF SMB HML RF
3351 20200622 0.0071 0.83 -1.42 0.000
3352 20200623 0.0042 0.15 -0.56 0.000
3353 20200624 -0.0261 -0.52 -1.28 0.000
3354 20200625 0.0112 0.25 0.50 0.000
3355 20200626 -0.0243 0.16 -1.37 0.000
3356 20200629 0.0151 1.25 1.80 0.000
The reason I need this structure is because I am trying to send them to a function I wrote to do linear regressions. But the first line of my function aims to merge these lists. When I merge them I end up with a null structure even thought my lists clearly have the same number of rows. In my function df is li. The embedded list of li is confusing me. Can someone help please?
Function I want to use:
Bf <- function(df, fac){
#This function calculates the beta of the french fama factor #using linear regression
#Input: df = a dataframe containg returns of the security
# fac = dataframe containing excess market retrun and
# french fama 3 factor
#Output: a Beta vectors of the french fama model
temp <- merge(df, fac, by="Date")
temp <- temp[, !names(temp) %in% "Date"]
temp[ ,1] <- temp[,1] - temp$RF return(lm(temp[,1]~temp[,2]+temp[,3]+temp[,4])$coeff)
}
a: you are dealing with data frames and not lists
b: if you want to merge them, you need to modify the factors$date column to match that of li$fxe$date
try to do:
factors$date <- as.Date(strptime(factors$date, format = "%Y%M%d"))
This should convert, the factors column to "Date" format.
I'm doing auto Binning Histogram for my second time, but it looks elementary. I'm seeking help to improve it.
what I have tried is
> DAta <- read.table(text="Species DNA LINE LTR SINE Helitron Unclassified Unmasked
+ darius 2.68 10.37 18.00 1.52 3.64 0.03 63.79
+ Derian 2.74 10.59 16.61 1.56 4.24 0.03 64.23
+ rats 2.77 10.97 15.20 1.57 4.69 0.03 64.77
+ Mouos 2.53 10.42 17.33 1.42 3.68 0.02 64.6", header=TRUE)
> library(reshape2)
> DF1 <- melt(DF, id.var="Rank")
> DF1 <- melt(DAta, id.var="Species")
> library(ggplot2)
> ggplot(DF1, aes(x = Species, y = value, fill = variable)) +
+ geom_bar(stat = "identity")
Output:
How can I make the species name in Italic?
The order of the histogram should be as the same as the input? start from left to right (darius, Derian, rats and Mouos)
Colours and style to look better and reasonable.
There are 3 questions here:
To change the axis labels to italics, one needs adjust the
x.axis.text, see the question/answers referenced at the bottom.
To change the ordering of the axis labels, you need to specify the
variable Species as a factor variable defining the desire order of
the levels.
Finally, to change the color scheme, use the
scale_fill_ function. I like the colorBrewer package with several good color schemes available. There
are few other define scale_fill options available.
Note: this a barchart and not a histogram.
See the comments for additional details:
DAta <- read.table(text="Species DNA LINE LTR SINE Helitron Unclassified Unmasked
darius 2.68 10.37 18.00 1.52 3.64 0.03 63.79
Derian 2.74 10.59 16.61 1.56 4.24 0.03 64.23
rats 2.77 10.97 15.20 1.57 4.69 0.03 64.77
Mouos 2.53 10.42 17.33 1.42 3.68 0.02 64.6", header=TRUE)
#updated method to reshape data. tidyr is replacement for reshape2
library(tidyr)
library(tidyr)
DF1 <- pivot_longer(DAta, cols=-1, names_to = "Classification", values_to = "Value" )
#Set Species as factors defining the order of the labels
DF1$Species<-factor(DF1$Species, levels=c("darius", "Derian", "rats", "Mouos"))
library(ggplot2)
ggplot(DF1, aes(x = Species, y = Value, fill = Classification)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Pastel1") +
theme(axis.text.x = element_text(face="italic"))
Option: If the number of columns or the naming of the columns can change then here is a potential option for maintaining the proper ordering of the Species names:
#retrieves column names from original dataframe the 2nd to the end
# assumes the columns are "Species" and then only the species names
DF1$Species<-factor(DF1$Species, levels= names(DAta)[-1])
To adjust the axis labels here is a good reference:
Changing font size and direction of axes text in ggplot2
So I would like to read in data and do a summation of one of the columns of 18000 points of data. The thing is the summation requires the variable Tc and then to subtract five iterations before. I don't know how to make it start at its summation 5 data points down so it does not give me an error that there is nothing to subtract in the first 4 data points.
Here is what a small portion of the data looks like:
head(data)
Time Record Ux Uy Uz Ts Tc Tn To Tp Tq
1 2016-09-07 09:00:00.1 38651948 0.46 1.21 -0.26 19.53 19.31726 20.43197 19.39093 19.54993 NAN
2 2016-09-07 09:00:00.2 38651949 0.53 1.24 -0.24 19.48 19.30391 20.43744 19.37996 19.51704 NAN
3 2016-09-07 09:00:00.3 38651950 0.53 1.24 -0.24 19.48 19.31249 20.43269 19.3752 19.44648 NAN
4 2016-09-07 09:00:00.4 38651951 0.53 1.24 -0.24 19.48 19.30391 20.40221 19.33919 19.41596 NAN
5 2016-09-07 09:00:00.5 38651952 0.53 1.24 -0.24 19.48 19.24906 20.36079 19.31178 19.38068 NAN
6 2016-09-07 09:00:00.6 38651953 0.51 1.28 -0.28 19.44 19.20519 20.32008 19.30629 19.42693 NAN
Here is the code:
data <- read.csv(('TOA5_10815.raw_data5411_2016_09_07_0900.dat'),
header = FALSE,
dec = ",",
col.names = c("Time", "Record", "Ux", "Uy", "Uz", "Ts", "Tc", "Tn", "To", "Tp", "Tq"),
skip = 4)
Tc = data$Tc
sum = 0
m = 18000
j = 5
for (k in 1:(m-j)){
inner = (Tc[[k]]-Tc[[k-j]])
sum = sum + inner
}
final = 1/(m-j)*sum
Welcome to stackoverflow!
I would suggest you make a more reproducible example for your next questions here (see here).
To answer your question you can either to this in a for loop as you have been working on currently or in much more efficient way; using one type of apply functions (here: lapply). You can read more about these functions here.
Creating data set:
set.seed(1)
Tc<-rnorm(18000)
The lapply function. Note that we are starting on 6, since Tc[5] - Tc[c(5-5)] would just return Tc[5].
sum<-unlist(lapply(6:18000,function(x) Tc[x]-Tc[c(x-5)]))
Done!
Verifying the function by typing in console:
> head(sum)
[1] -0.1940146 0.3037857 1.5739533 -1.0194995 -0.6348962 2.3322496
> Tc[6]-Tc[1]
[1] -0.1940146
I did some calculations in R and I want to produce it into excel like this
DATA1 DATA2
54.364 2.05
56.532
54.21
41.485
65.8745
54.0546
75.156
but instead is coming like this
DATA1 DATA2
54.364 2.05
56.532 2.05
54.21 2.05
41.485 2.05
65.8745 2.05
54.0546 2.05
75.156 2.05
My function to produce it in excel is
write.xlsx(c(data.frame(DATA1),data.frame(DATA2)))
Although data1 has values of 54.364, 56.532, 54.21, 41.485, 65.8745, 54.0546, 75.156 and data2 2.05
Excel has a rather bizarre "copy down" feature where it copies a function returning a scalar into every cell in the calling range. It appears that this is happening to you here.
One way to work round this is to use Application.Caller at the top of the function that's called directly. This returns a Range object denoting the calling range. You can then pad your function return values with #N/A. You do this by inserting variant types into your array set to VT_ERROR and the error vales set to xlErrNa. You can use CVErr(xlErrNa) to do that in one step. Padding with #N/A matches what Excel does with oversized calling ranges for functions returning arrays.
Following code can also be used:
(using #akrun's data in https://stackoverflow.com/questions/25547210/how-to-produce-this-order-in-r)
DATA1 <- c(54.364, 56.532, 54.21, 41.845, 65.8745, 54.0546, 75.156)
DATA2 <- 2.05
DATA3 <- c(2.2, 2.4, 2.32)
outdf = data.frame(data1=numeric(), data2=numeric(), data3=numeric())
for(i in 1:length(DATA1)) outdf[i,]=c(DATA1[i],0,0)
for(i in 1:length(DATA2)) outdf$data2[i]=DATA2[i]
for(i in 1:length(DATA3)) outdf$data3[i]=DATA3[i]
outdf
data1 data2 data3
1 54.3640 2.05 2.20
2 56.5320 0.00 2.40
3 54.2100 0.00 2.32
4 41.8450 0.00 0.00
5 65.8745 0.00 0.00
6 54.0546 0.00 0.00
7 75.1560 0.00 0.00
Then you can use outdf with write.xlsx .
I am writing an R script that incoorporates a data frame.
The data frame has the following look:
mydf <= read.csv('file', header = TRUE, sep=",")
mydf
....Prod Date AVG
189 CA123 2012/07/24 14:32:35 0.2424 0.22 0.25 0.27
190 JK489 2012/08/25 18:29:08 0.2402 0.22 0.25 0.27
191 CA15K 2012/07/24 13:49:07 0.2427 0.22 0.25 0.27
192 JA45A 2012/07/22 02:32:40 0.2455 0.22 0.25 0.27
193 JA3HS 2012/07/24 22:26:25 0.2410 0.22 0.25 0.27
194 CA429 2012/08/28 10:36:16 0.2351 0.22 0.25 0.27
195 JK345 2012/07/25 07:11:24 0.2419 0.22 0.25 0.27
...
I am using this code to plot the data:
plot(Date,mydf$AVG,xlab='Date',ylab='AVG',main='title')
legend("topright", legend = c(" "," "), text.width = strwidth("1,000,000"), lty = 1:2, xjust = 1, yjust = 1, title = "Prods")
The plot is working fine, but I am unable to get the Legend formatting down. What I want to do is place a legend in the top right that will display each Prod as a different color data point on the graph; however, Prod also needs to be trucated and only count as the first two characters in the column.
I know I can access all the variables by running: `mydf$Prod', but is there a way to truncate each item in that frame reference to just two characters. I tried using round, but I am unable to perform any math operations, which makes sense.
Is there a way to truncate these variables and then paste them into the legend keeping the truncated format. The legend will need to be dynamic, because the Prod's are constantly changing, and I run the script on different files.
One additional item: Ideally, I would like this to be done with just the standard librarys. Im not currently using ggplot, or any other graphing library, as the graphs I am creating are simple.
Try this:
mydf$Labels = substr(mydf$Prod, 1, 2)
f = factor(mydf$Labels)
l = levels(f)
plot(mydf$Date, mydf$Avg, xlab="Date", ylab="Avg", col=f)
legend("topright", legend = l, fill = 1:length(l), title = "Prods")