Adding new Data rows in R - r

I am trying to build a data frame so I can generate a Plot with a specific set of data, but I am having trouble getting the data into a table correctly.
So, here is what I have available from a data query:
> head(c, n=10)
EVTYPE FATALITIES INJURIES
834 TORNADO 5633 91346
856 TSTM WIND 504 6957
170 FLOOD 470 6789
130 EXCESSIVE HEAT 1903 6525
464 LIGHTNING 816 5230
275 HEAT 937 2100
427 ICE STORM 89 1975
153 FLASH FLOOD 978 1777
760 THUNDERSTORM WIND 133 1488
244 HAIL 15 1361
I then tried to generate a set of data variables to build a finished a data.frame like this:
a <- c(c[1,1], c[1,2], c[1,3])
b <- c(c[6,1], c[4,2] + c[6,2], c[4,3] + c[6,3])
d <- c(c[2,1], c[2,2], c[2,3])
e <- c(c[3,1], c[3,2], c[3,3])
f <- c(c[5,1], c[5,2], c[5,3])
g <- c(c[7,1], c[7,2], c[7,3])
h <- c(c[8,1], c[8,2], c[8,3])
i <- c(c[9,1], c[9,2], c[9,3])
j <- c(c[10,1], c[10,2], c[10,3])
k <- c(c[11,1], c[11,2], c[11,3])
df <- data.frame(a,b,d,e,f,g,h,i,j)
names(df) <- c("Event", "Fatalities","Injuries")
But, that is failing miserably. What I am getting is a long string of all the data variables, repeated 10 times. nice trick, but that is not what I am looking for.
I would like to get a finished data.frame with ten (10) rows of the data, like it was originally, but with my combined data in place. Is that possible.
I am using R version 3.5.3. and the tidyverse library is not available for install on that version.
Any ideas as to how I can generate that data.frame?

If a barplot is what you're after, here's a piece of code to get you that:
First, you need to get the data in the right format (that's probably what you tried to do in df), by column-binding the two numerical variables using cbindand transposing the resulting dataframe using t(i.e., turning rows into columns and vice versa):
plotdata <- t(cbind(c$FATALITIES, c$INJURIES))
Then set the layout to your plot, with a wide margin for the x-axis to accommodate your long factor names:
par(mfrow=c(1,1), mar = c(8,3,3,3))
Now you're ready to plot the data; you grab the labels from c$EVTYPE, reduce the label size in cex.names and rotate them with las to avoid overplotting:
barplot(plotdata, beside=T, names = c$EVTYPE, col=c("red","blue"), cex.names = 0.7, las = 3)
(You can add main =to define the heading to your plot.)
That's the barplot you should obtain:

Related

Mean Y for individual X values

I have a data set in .dta format with height and weight of baseball players. I want to calculate the mean height for each individual weight value.
From what I've been able to find, I could use dplyr and "group_by", but my R script does not recognize the command, despite having installed and called the package.
Thanks!
Here is an example coded in base R using baseball player height and weight data obtained from the UCLA SOCR MLB HeightsWeights data set.
After cleaning the data (weight is missing for one player), I posted it to GitHub to make it accessible without having to clean it again.
theCSVFile <- "https://raw.githubusercontent.com/lgreski/datasciencedepot/gh-pages/data/baseballPlayers.csv"
download.file(theCSVFile,"./data/baseballPlayers.csv",method="curl")
theData <- read.csv("./data/baseballPlayers.csv",header=TRUE,stringsAsFactors=FALSE)
aggData <- aggregate(HeightInInches ~ WeightInPounds,mean,
data=theData)
head(aggData)
...and the output is:
> head(aggData)
WeightInPounds HeightInInches
1 150 70.75000
2 155 69.33333
3 156 75.00000
4 160 71.46667
5 163 70.00000
6 164 73.00000
>
regards,
Len

How to create a heat map in R?

I am doing a multiple part project. To begin with I had a data set which provided the deposits per district over the years. After scrubbing the data set, I was able to create a data frame, which provides the growth of deposits by district. I have growth of deposits by 3 different kinds of institutions - foreign banks, public banks and private banks in 3 different data frames as the # of rows differs in each frame. I have been asked to create 3 maps (heat maps) with deposit growth against each of the kind of banks.
My data frame looks like the attached picture.
I want to make a heat map for the growth column. enter image description here
Thanks.
Maybe I provide some spam by this answer, so delete it without hasitation.
I'll show you how I make some heatmaps in R:
Fake data:
Gene Patient_A Patient_B Patient_C Patient_D
BRCA1 52 46 124 148
TP53 512 487 112 121
FOX3D 841 658 321 364
MAPK1 895 541 198 254
RASA1 785 554 125 69
ADAM18 12 65 85 121
hmcols <- rev(redgreen(2750))
heatmap.2(hm_mx, scale="row", key=TRUE, lhei=c(2,5), symkey="FALSE", density.info="none", trace="none", cexRow=1.1, cexCol=1.1, col=hmcols, dendrogram = "none")
In case of read.table you propably will have to convert data frame to matrix and put first column as a row names to avoid errors from R:
hm <- read.table("hm1.txt", sep = '\t', header=TRUE, stringsAsFactors=FALSE)
row.names(hm) <- hm$Gene
hm_mx <- data.matrix(hm)
hm_mx <- hm_mx[,-c(1)]

Dump data to Data Frame and then plot

I've been working on a sleep analysis project for a while and now that I have some data gathered I'd like to do something. First of all, I have registered the movement of my sleep for a while and now is on a .csv file like so:
0:58 1:08 1:18 1:28 1:38 1:48 1:58
3096 4062 903 113 1331 76 521
0:30 0:40 0:50 1:00 1:10 1:20 1:30
4081 1661 1198 70 841 1052 76
0:47 0:57 1:07 1:17 1:27 1:37 1:47
2327 1823 1354 1547 64 75 84
The first row is the time in 10 minutes intervals and the second one is the quantity of movement. Each pair of lines is a night of sleep and the data continues until the wake up time arrives.
Now, I have to import the data to R and then work with it. I've imported the data by using the read.csv() function. But now I'm stuck, I guess I'll have to use a data frame to store the data because the two types of data I have one is time and the other one is an integer number. I've worked with arrays and matrices and I cannot really understand how a data frame would really fit in this program. In a case I get to understand data frames I don't know how to work with arrays/data frames of different sizes because each night has a different length depending on how much I've slept. I'd like to plot a timeline of the average night sleep time with the average movement.
I would like to know if my assumption of using data frames is correct and how would I work with arrays of different length to create the mean of all of them.
Thank you in advance!
EDIT
Using #Pierre Lafortune's code:
library(ggplot2)
df <-read.csv('/Users/jdmg718/Dropbox/GitHub/SleepAnalysisWithR/Movement.csv', stringsAsFactors=FALSE)
s <- split(df, rep(1:2, nrow(df)/2))
newdf <- as.data.frame(sapply(s, function(u) unlist(t(u))), stringsAsFactors=FALSE)
names(newdf) <- c('Time', 'Movements')
newdf[,2] <- as.numeric(newdf[,2])
ggplot(newdf, aes(x=Time, y=Movements, group=1)) + geom_line()
I am getting the following errors:
Warning messages:
1: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
largo de datos no es múltiplo de la variable de separación
2: In eval(expr, envir, enclos) : NAs introducidos por coerción
Try splitting the data by type. Then you can create the charts that you need:
df <- read.csv('sleep.csv', stringsAsFactors=FALSE)
s <- split(df, rep(1:2, nrow(df)/2))
newdf <- as.data.frame(sapply(s, function(u) unlist(t(u))), stringsAsFactors=FALSE)
names(newdf) <- c('Time', 'Movements')
newdf[,2] <- as.numeric(newdf[,2])
Line Graph
library(ggplot2)
ggplot(newdf, aes(x=Time, y=Movements, group=1)) + geom_line()

R - plotting specific columns as x with two rows as the lines

Here is a small sample of my data:
gene_name ctrl_lsm1_ratio_t0 ctrl_lsm1_ratio_t1 ctrl_lsm1_ratio_t2
22 ABP140 -0.262682 -0.303352 -0.223626
246 ARI1 -0.163952 -0.374765 -0.321876
454 BPH1 -0.517519 -0.524553 -0.747609
513 BUR6 0.645573 0.217433 0.390403
588 CDC20 -0.264072 -0.665268 -0.594191
ctrl_lsm1_ratio_t3 ctrl_lsm1_stat_t0 ctrl_lsm1_stat_t1 ctrl_lsm1_stat_t2
22 -0.421704 no no no
246 -0.692391 no no no
454 -0.793595 no no yes
513 0.200799 yes no no
588 -0.523884 no yes yes
ctrl_lsm1_stat_t3 systematic_name
22 yes YOR239W
246 yes YGL157W
454 yes YCR032W
513 no YER159C
588 yes YGL116W
I would like to plot columns [,2:5] on the x axis (as in time point 0, 1, 2, and 3) with the y axis fitting the ratio columns.
If there's a way to color the points to be one color for "yes" or "no" at the specific time points, I would also like to be able to do that. (for instance, points in the ctrl_lsm1_ratio_t0 column would be colored based on values in the ctrl_lsm1_stat_t0 column).
I also only want to plot two rows at a time, both as lines (for instance row 22 with row 513). Hope this makes sense! I'm new to R and not sure what to do. I'm willing to download whatever package necessary.
data.csv:
gene_name,ctrl_lsm1_ratio_t0,ctrl_lsm1_ratio_t1,ctrl_lsm1_ratio_t2,ctrl_lsm1_ratio_t3,ctrl_lsm1_stat_t0,ctrl_lsm1_stat_t1,ctrl_lsm1_stat_t2,ctrl_lsm1_stat_t3,systematic_name
ABP140,-0.262682,-0.303352,-0.22362,-0.421704,no,no,no,yes,YOR239W6
ARI1,-0.163952,-0.374765,-0.32187,-0.692391,no,no,no,yes,YGL157W6
BPH1,-0.517519,-0.524553,-0.74760,-0.793595,no,no,yes,yes,YCR032W9
BUR6,0.645573,0.217433,0.39040,0.200799,yes,no,no,no,YER159C3
CDC20,-0.264072,-0.665268,-0.59419,-0.523884,no,yes,yes,yes,YGL116W1
Code:
d<-read.csv("data.csv", header=T, stringsAsFactors=F)
matplot(t(d[,2:5]), type="l", pch=20, lty=1, xlab="time", ylab="ctrl_lsm1_ratio")
d2<-reshape(d[,6:9],varying=list(names(d[,6:9])),direction="long",v.name="ctrl_lsm1_stat", ids=d$gene_name)
points(d2$time, unlist(d[,2:5]), col=ifelse(d2$ctrl_lsm1_stat=="yes",1,2),cex=2.0)
legend("topright",legend=c("yes","no"), col=c(1,2), pch=21)

R: Plots of subset still include excluded attributes, how do I get draw a plot without them?

I am trying to draw a boxplot in R:
I have a dataset with 70 attributes:
The format is
patient number medical_speciality number_of_procedures
111 Ortho 21
232 Emergency 16
878 Pediatrics 20
981 OBGYN 31
232 Care of Elderly 15
211 Ortho 32
238 Care of Elderly 11
219 Care of Elderly 6
189 Emergency 67
323 Emergency 23
189 Pediatrics 1
289 Ortho 34
I have been trying to get a subset to only include emergency, pediatrics in a boxplot (there are 10000+ datapoints in reality)
I thought that I could just do this:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
plot(newdata)
Since if I do a summary of newdata, all it has is the pediatrics and emergency results. But when it comes to plotting it still includes the ortho, OBGYN, care of elderly in the x axis with no boxplot.
I presume that there is a way to do this in ggplot by doing
ggplot(newdata, aes(x=medical_speciality, y=num_of_procedures, fill=cond)) + geom_boxplot()
but this gives me the error:
Don't know how to automatically pick scale for object of type data.frame.
Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:cond
Can someone help me out?
I believe your problem comes from the fact that the column medical_speciality is a factor.
So, even though you subset your data the right way, you still get all the levels (including "Ortho", "OBGYN", etc...).
You can get rid of them by using the function droplevels:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
newdata <- droplevels(newdata) ## THIS IS THE NEW ADDITION
plot(newdata)
Does this help?

Resources