I've been working on a sleep analysis project for a while and now that I have some data gathered I'd like to do something. First of all, I have registered the movement of my sleep for a while and now is on a .csv file like so:
0:58 1:08 1:18 1:28 1:38 1:48 1:58
3096 4062 903 113 1331 76 521
0:30 0:40 0:50 1:00 1:10 1:20 1:30
4081 1661 1198 70 841 1052 76
0:47 0:57 1:07 1:17 1:27 1:37 1:47
2327 1823 1354 1547 64 75 84
The first row is the time in 10 minutes intervals and the second one is the quantity of movement. Each pair of lines is a night of sleep and the data continues until the wake up time arrives.
Now, I have to import the data to R and then work with it. I've imported the data by using the read.csv() function. But now I'm stuck, I guess I'll have to use a data frame to store the data because the two types of data I have one is time and the other one is an integer number. I've worked with arrays and matrices and I cannot really understand how a data frame would really fit in this program. In a case I get to understand data frames I don't know how to work with arrays/data frames of different sizes because each night has a different length depending on how much I've slept. I'd like to plot a timeline of the average night sleep time with the average movement.
I would like to know if my assumption of using data frames is correct and how would I work with arrays of different length to create the mean of all of them.
Thank you in advance!
EDIT
Using #Pierre Lafortune's code:
library(ggplot2)
df <-read.csv('/Users/jdmg718/Dropbox/GitHub/SleepAnalysisWithR/Movement.csv', stringsAsFactors=FALSE)
s <- split(df, rep(1:2, nrow(df)/2))
newdf <- as.data.frame(sapply(s, function(u) unlist(t(u))), stringsAsFactors=FALSE)
names(newdf) <- c('Time', 'Movements')
newdf[,2] <- as.numeric(newdf[,2])
ggplot(newdf, aes(x=Time, y=Movements, group=1)) + geom_line()
I am getting the following errors:
Warning messages:
1: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
largo de datos no es múltiplo de la variable de separación
2: In eval(expr, envir, enclos) : NAs introducidos por coerción
Try splitting the data by type. Then you can create the charts that you need:
df <- read.csv('sleep.csv', stringsAsFactors=FALSE)
s <- split(df, rep(1:2, nrow(df)/2))
newdf <- as.data.frame(sapply(s, function(u) unlist(t(u))), stringsAsFactors=FALSE)
names(newdf) <- c('Time', 'Movements')
newdf[,2] <- as.numeric(newdf[,2])
Line Graph
library(ggplot2)
ggplot(newdf, aes(x=Time, y=Movements, group=1)) + geom_line()
Related
I would like to visualize the number of people infected with COVID-19, but I am unable to obtain the mortality rate because the number of deaths is derived by int when obtaining the mortality rate per 100,000 population for each prefecture.
What I want to achieve
I want to find the solution of "covid19j_20200613$POP2019 * 100" by setting the data type of "covid19j_20200613$deaths" to num.
Error message.
Error in covid19j_20200613$deaths/covid19j_20200613$POP2019:
Argument of binary operator is not numeric
Source code in question.
library(spdep)
library(sf)
library(spatstat)
library(tidyverse)
library(ggplot2)
needs::prioritize(magrittr)
covid19j <- read.csv("https://raw.githubusercontent.com/kaz-ogiwara/covid19/master/data/prefectures.csv",
header=TRUE)
# Below is an example for May 20, 2020.
# Month and date may be changed
covid19j_20200613 <- dplyr::filter(covid19j,
year==2020,
month==6,
date==13)
covid19j_20200613$CODE <- 1:47
covid19j_20200613[is.na(covid19j_20200613)] <- 0
pop19 <- read.csv("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/pop2019.csv", header=TRUE)
covid19j_20200613 <- dplyr::inner_join(covid19j_20200613, pop19,
by = c("CODE" = "CODE"))
# Load Japan prefecture administrative boundary data
jpn_pref <- sf::st_read("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/jpn_pref.shp")
# Data and concatenation
jpn_pref_cov19 <- dplyr::inner_join(jpn_pref, covid19j_20200613, by=c("PREF_CODE"="CODE"))
ggplot2::ggplot(data = jpn_pref_cov19) +
geom_sf(aes(fill=testedPositive)) +
scale_fill_distiller(palette="RdYlGn") +
theme_bw() +
labs(title = "Tested Positiv of Covid19 (2020/06/13)")
# Mortality rate per 100,000 population
# Population number in units of 1000
as.numeric(covid19j_20200613$deaths)
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
Source code in question.
prefectures.csv
https://docs.google.com/spreadsheets/d/11C2vVo-jdRJoFEP4vAGxgy_AEq7pUrlre-i-zQVYDd4/edit?usp=sharing
pop2019.csv
https://docs.google.com/spreadsheets/d/1CbEX7BADutUPUQijM0wuKUZFq2UUt-jlWVQ1ipzs348/edit?usp=sharing
What we tried
I tried to put "as.numeric(covid19j_20200613$deaths)" before the calculation and set the number of dead to type
num, but I got the same error message during the calculation.
Additional information (FW/tool versions, etc.)
iMac M1 2021, R 4.2.0
Translated with www.DeepL.com/Translator (free version)
as.numeric() does not permanently change the data type - it only does it temporarily.
So when you're running as.numeric(covid19j_20200613$deaths), this shows you the column deaths as numeric, but the column will stay a character.
So if you want to coerce the data type, you need to also reassign:
covid19j_20200613$deaths <- as.numeric(covid19j_20200613$deaths)
covid19j_20200613$POP2019 <- as.numeric(covid19j_20200613$POP2019)
# Now you can do calculations
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
It's easier to read if you use mutate from dplyr:
covid19j_20200613 <- covid19j_20200613 |>
mutate(
deaths = as.numeric(deaths),
POP2019 = as.numeric(POP2019),
death_rate = deaths / POP2019 * 100
)
Result
deaths POP2019 deaths_rate
1 91 5250 1.73333333
2 1 1246 0.08025682
3 0 1227 0.00000000
4 1 2306 0.04336513
5 0 966 0.00000000
PS: your question is really difficult to follow! There is a lot of stuff that we don't actually need to answer it, so that makes it harder for us to identify where the issue is. For example, all the data import, the join, the ggplot...
When writing a question, please only include the minimal elements that lead to a problem. In your case, we only needed a sample dataset with the deaths and POP2019 columns, and the two lines of code that you tried to fix at the end.
If you look at str(covid19j) you'll see that the deaths column is a character column containing a lot of blanks. You need to figure out the structure of that column to read it properly.
I am trying to build a data frame so I can generate a Plot with a specific set of data, but I am having trouble getting the data into a table correctly.
So, here is what I have available from a data query:
> head(c, n=10)
EVTYPE FATALITIES INJURIES
834 TORNADO 5633 91346
856 TSTM WIND 504 6957
170 FLOOD 470 6789
130 EXCESSIVE HEAT 1903 6525
464 LIGHTNING 816 5230
275 HEAT 937 2100
427 ICE STORM 89 1975
153 FLASH FLOOD 978 1777
760 THUNDERSTORM WIND 133 1488
244 HAIL 15 1361
I then tried to generate a set of data variables to build a finished a data.frame like this:
a <- c(c[1,1], c[1,2], c[1,3])
b <- c(c[6,1], c[4,2] + c[6,2], c[4,3] + c[6,3])
d <- c(c[2,1], c[2,2], c[2,3])
e <- c(c[3,1], c[3,2], c[3,3])
f <- c(c[5,1], c[5,2], c[5,3])
g <- c(c[7,1], c[7,2], c[7,3])
h <- c(c[8,1], c[8,2], c[8,3])
i <- c(c[9,1], c[9,2], c[9,3])
j <- c(c[10,1], c[10,2], c[10,3])
k <- c(c[11,1], c[11,2], c[11,3])
df <- data.frame(a,b,d,e,f,g,h,i,j)
names(df) <- c("Event", "Fatalities","Injuries")
But, that is failing miserably. What I am getting is a long string of all the data variables, repeated 10 times. nice trick, but that is not what I am looking for.
I would like to get a finished data.frame with ten (10) rows of the data, like it was originally, but with my combined data in place. Is that possible.
I am using R version 3.5.3. and the tidyverse library is not available for install on that version.
Any ideas as to how I can generate that data.frame?
If a barplot is what you're after, here's a piece of code to get you that:
First, you need to get the data in the right format (that's probably what you tried to do in df), by column-binding the two numerical variables using cbindand transposing the resulting dataframe using t(i.e., turning rows into columns and vice versa):
plotdata <- t(cbind(c$FATALITIES, c$INJURIES))
Then set the layout to your plot, with a wide margin for the x-axis to accommodate your long factor names:
par(mfrow=c(1,1), mar = c(8,3,3,3))
Now you're ready to plot the data; you grab the labels from c$EVTYPE, reduce the label size in cex.names and rotate them with las to avoid overplotting:
barplot(plotdata, beside=T, names = c$EVTYPE, col=c("red","blue"), cex.names = 0.7, las = 3)
(You can add main =to define the heading to your plot.)
That's the barplot you should obtain:
I'm relatively new to R so please bear with me. I'm trying to get to grips with basic irregular time-series analysis.
That's what my data file looks like, some 40k lines. The spacing is not always exactly 20sec.
Time, Avg
04/03/2015 00:00:23,20.24
04/03/2015 00:00:43,20.38
04/03/2015 00:01:03,20.53
04/03/2015 00:01:23,20.54
04/03/2015 00:01:43,20.53
data <- read.zoo("data.csv",sep=",",tz='',header=T,format='%d/%m/%Y %H:%M:%S')
I'm happy to aggregate by minutes
data <- to.minutes(as.xts(data))
Using the "open" column as an example
head(data[,1])
as.xts(data).Open
2015-03-04 00:00:43 20.24
2015-03-04 00:01:43 20.53
2015-03-04 00:02:43 20.47
2015-03-04 00:03:43 20.38
2015-03-04 00:04:43 20.05
2015-03-04 00:05:43 19.84
data <- data[,1]
And here is where it all falls apart for me
fit <- stl(data, t.window=15, s.window="periodic", robust=TRUE)
Error in stl(data, t.window = 15, s.window = "periodic", robust = TRUE) :
series is not periodic or has less than two periods
I've googled the error message, but it's not really clear to me. Is period = frequency? For my dataset I would expect the seasonal component to be weekly.
frequency(data) <- 52
fit <- stl(data, t.window=15, s.window="periodic", robust=TRUE)
Error in na.fail.default(as.ts(x)) : missing values in object
?
head(as.ts(data))
[1] 20.24 NA NA NA NA NA
Uh, what?
What am I doing wrong? How do I have to prepare the xts object to be able to properly pass it to stl()?
Thank you.
I extract numeric values of xts_object and build a ts object for stl function. However the time stamps of xts_object is completely ignored in this case.
stl(ts(as.numeric(xts_object), frequency=52), s.window="periodic", robust=TRUE)
I would like to plot latitude vs longitude and connect the points via date and time, which I have stored in an object of class POSIXlt. I have many, many GPS points, but here is a small set of them that I would like to plot using ggplot2.
My data are like so:
Description lat lon
6/16/2012 17:22 12.117017 -89.69692
6/17/2012 9:15 12.1178 -89.69675
6/17/2012 9:33 12.117783 -89.69673
6/17/2012 10:19 12.11785 -89.69665
6/17/2012 10:45 12.11775 -89.69677
6/17/2012 11:22 12.1178 -89.69673
6/17/2012 11:39 12.117817 -89.69662
6/17/2012 11:59 12.117717 -89.69677
6/17/2012 12:10 12.117717 -89.69655
6/16/2012 16:38 12.11795 -89.6965
6/16/2012 18:29 12.1178 -89.69688
6/16/2012 17:11 12.117417 -89.69703
6/16/2012 17:36 12.116967 -89.69668
6/16/2012 17:50 12.117217 -89.69695
6/16/2012 18:02 12.117583 -89.69715
6/16/2012 18:15 12.11785 -89.69665
6/16/2012 18:27 12.117683 -89.69632
I have a map that I am plotting these points onto.
I can plot the points just fine
plot1 <- map + geom_point(data=dat, aes(x = lon, y = lat))
map is an object I made with ggmap, but it's not that important to include here.
The following code produces a line connecting points as lon increases
plot1+geom_line(data=dat, aes(x=lon,y=lat,colour="red"))
I can't figure out how to connect the points by the vector POSIXlt object Description
I know that in this small example I could easily reorder the points using something like dat2 <- dat[with(dat, order(Description)), ], and remake plot1 using dat2 and make the desired plot using the following code:
plot1+geom_path(data=dat2, aes(x = lond, y = latd, colour="red"))
But for my much larger (hundreds of thousands of observations) dataset, this doesn't make sense as a solution without a bit more work to properly id each observation, which I will certainly end up doing anyway as part of additional data exploration.
Is there an argument I haven't discovered in geom_line for telling R how to connect the points?
I am admittedly still a novice at using ggplot2, and so, I apologize if I have missed something very simple. I have been working on a lot of other code and learning, or at least using, several other packages, to work with this GPS data other spatial data available. It's all a bit overwhelming... So many ideas, so little know-how! The larger point of this is to visualize (and eventually analyze) movement patterns and use of space by my study organisms, but for now, it would be great to visualize the data in a variety of ways to really get familiar with it.
If you have any recommended packages for working with spatial data and GPS data, I'd love to hear about them, as well.
You need the rows ordered by the date/time object to use geom_path. Since I think this is the best way to display the data we should focus on finding an efficient way to sort a large dataset. Obviously it would be good to get an idea of the scale of dataset you are working with. Millions of rows? Billions perhaps?!
Fortunately the data.table package does this very well indeed. Here is an example on a 1 million row table, with an ID column X, which the table is originally sorted on, an unsorted time column of 1 second observations, and two random columns for x and y, which takes < 1s on my laptop t sort according to date/time:
set.seed(123)
require(data.table)
# Rows ordered on X, random order of unique date/time values of 1 second observations
df <- data.frame( ID = seq.int(1e6) , Desc = as.POSIXct(sample(1e6),origin=Sys.Date()) , x = runif(1e6) , y = runif(1e6) )
head(df)
# ID Desc x y
#1 1 2013-05-25 02:39:39 0.2363783 0.1387404
#2 2 2013-05-25 23:58:17 0.1192702 0.1284918
#3 3 2013-05-21 17:41:57 0.8599183 0.6301114
#4 4 2013-05-23 16:12:42 0.8089243 0.7919304
#5 5 2013-05-21 08:17:28 0.8197109 0.4568693
#6 6 2013-05-22 17:57:23 0.4611204 0.5358536
# Convert to data.table
DT <- data.table(df)
# Sort on 'Desc'
setkey(DT , Desc)
head(DT)
# ID Desc x y
#1: 544945 2013-05-18 01:00:01 0.7052422 0.52030877
#2: 886165 2013-05-18 01:00:02 0.2256636 0.04391553
#3: 893690 2013-05-18 01:00:03 0.1860687 0.30978506
#4: 932276 2013-05-18 01:00:04 0.6305562 0.65188810
#5: 407622 2013-05-18 01:00:05 0.5355992 0.98146120
#6: 138936 2013-05-18 01:00:06 0.5999025 0.81722902
# Make data.frame to from this to use with ggplot2 (not sure if you can't just use the data.table directly)
df2 <- DT
So in your case you can try something like:
datDT <- data.table(dat)
setkey(datDT , Description)
dat2 <- datDT
I need some help with data analysis.
I do have two datasets (before & after) and I want to see how big the difference is between them.
Before
11330 STAT1
2721 STAT2
52438 STAT3
6124 SUZY
After
17401 STAT1
3462 STAT2
0 STAT3
72 SUZY
Tried to group them with tapply(before$V1, before$V2, FUN=mean).
But as I am trying to plot it, on x axis am not getting the group name but number instead.
How can I plot such tapplied data (frequency on Y axis & group name on X axis)?
Also wanted to ask what is the proper command in R to compare such datasets as I am willing to find the difference between them?
Edited
dput(before$V1)
c(11330L, 2721L, 52438L, 6124L)
dput(before$V2)
structure(1:4, .Label = c("STAT1", "STAT2", "STAT3","SUZY"),class = "factor")
Here are a couple of ideas.
This is what I think your data look like?
before <- data.frame(val=c(11330,2721,52438,6124),
lab=c("STAT1","STAT2","STAT3","SUZY"))
after <- data.frame(val=c(17401,3462,0,72),
lab=c("STAT1","STAT2","STAT3","SUZY"))
Combine them into a single data frame with a period variable:
combined <- rbind(data.frame(before,period="before"),
data.frame(after,period="after"))
Reformat to a matrix and plot with (base R) dotchart:
library(reshape2)
m <- acast(combined,lab~period,value.var="val")
dotchart(m)
Plot with ggplot:
library(ggplot2)
qplot(lab,val,colour=period,data=combined)