Comparing multiple data frames - r

I need some help with data analysis.
I do have two datasets (before & after) and I want to see how big the difference is between them.
Before
11330 STAT1
2721 STAT2
52438 STAT3
6124 SUZY
After
17401 STAT1
3462 STAT2
0 STAT3
72 SUZY
Tried to group them with tapply(before$V1, before$V2, FUN=mean).
But as I am trying to plot it, on x axis am not getting the group name but number instead.
How can I plot such tapplied data (frequency on Y axis & group name on X axis)?
Also wanted to ask what is the proper command in R to compare such datasets as I am willing to find the difference between them?
Edited
dput(before$V1)
c(11330L, 2721L, 52438L, 6124L)
dput(before$V2)
structure(1:4, .Label = c("STAT1", "STAT2", "STAT3","SUZY"),class = "factor")

Here are a couple of ideas.
This is what I think your data look like?
before <- data.frame(val=c(11330,2721,52438,6124),
lab=c("STAT1","STAT2","STAT3","SUZY"))
after <- data.frame(val=c(17401,3462,0,72),
lab=c("STAT1","STAT2","STAT3","SUZY"))
Combine them into a single data frame with a period variable:
combined <- rbind(data.frame(before,period="before"),
data.frame(after,period="after"))
Reformat to a matrix and plot with (base R) dotchart:
library(reshape2)
m <- acast(combined,lab~period,value.var="val")
dotchart(m)
Plot with ggplot:
library(ggplot2)
qplot(lab,val,colour=period,data=combined)

Related

Argument is not numeric

I would like to visualize the number of people infected with COVID-19, but I am unable to obtain the mortality rate because the number of deaths is derived by int when obtaining the mortality rate per 100,000 population for each prefecture.
What I want to achieve
I want to find the solution of "covid19j_20200613$POP2019 * 100" by setting the data type of "covid19j_20200613$deaths" to num.
Error message.
Error in covid19j_20200613$deaths/covid19j_20200613$POP2019:
Argument of binary operator is not numeric
Source code in question.
library(spdep)
library(sf)
library(spatstat)
library(tidyverse)
library(ggplot2)
needs::prioritize(magrittr)
covid19j <- read.csv("https://raw.githubusercontent.com/kaz-ogiwara/covid19/master/data/prefectures.csv",
header=TRUE)
# Below is an example for May 20, 2020.
# Month and date may be changed
covid19j_20200613 <- dplyr::filter(covid19j,
year==2020,
month==6,
date==13)
covid19j_20200613$CODE <- 1:47
covid19j_20200613[is.na(covid19j_20200613)] <- 0
pop19 <- read.csv("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/pop2019.csv", header=TRUE)
covid19j_20200613 <- dplyr::inner_join(covid19j_20200613, pop19,
by = c("CODE" = "CODE"))
# Load Japan prefecture administrative boundary data
jpn_pref <- sf::st_read("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/jpn_pref.shp")
# Data and concatenation
jpn_pref_cov19 <- dplyr::inner_join(jpn_pref, covid19j_20200613, by=c("PREF_CODE"="CODE"))
ggplot2::ggplot(data = jpn_pref_cov19) +
geom_sf(aes(fill=testedPositive)) +
scale_fill_distiller(palette="RdYlGn") +
theme_bw() +
labs(title = "Tested Positiv of Covid19 (2020/06/13)")
# Mortality rate per 100,000 population
# Population number in units of 1000
as.numeric(covid19j_20200613$deaths)
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
Source code in question.
prefectures.csv
https://docs.google.com/spreadsheets/d/11C2vVo-jdRJoFEP4vAGxgy_AEq7pUrlre-i-zQVYDd4/edit?usp=sharing
pop2019.csv
https://docs.google.com/spreadsheets/d/1CbEX7BADutUPUQijM0wuKUZFq2UUt-jlWVQ1ipzs348/edit?usp=sharing
What we tried
I tried to put "as.numeric(covid19j_20200613$deaths)" before the calculation and set the number of dead to type
num, but I got the same error message during the calculation.
Additional information (FW/tool versions, etc.)
iMac M1 2021, R 4.2.0
Translated with www.DeepL.com/Translator (free version)
as.numeric() does not permanently change the data type - it only does it temporarily.
So when you're running as.numeric(covid19j_20200613$deaths), this shows you the column deaths as numeric, but the column will stay a character.
So if you want to coerce the data type, you need to also reassign:
covid19j_20200613$deaths <- as.numeric(covid19j_20200613$deaths)
covid19j_20200613$POP2019 <- as.numeric(covid19j_20200613$POP2019)
# Now you can do calculations
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
It's easier to read if you use mutate from dplyr:
covid19j_20200613 <- covid19j_20200613 |>
mutate(
deaths = as.numeric(deaths),
POP2019 = as.numeric(POP2019),
death_rate = deaths / POP2019 * 100
)
Result
deaths POP2019 deaths_rate
1 91 5250 1.73333333
2 1 1246 0.08025682
3 0 1227 0.00000000
4 1 2306 0.04336513
5 0 966 0.00000000
PS: your question is really difficult to follow! There is a lot of stuff that we don't actually need to answer it, so that makes it harder for us to identify where the issue is. For example, all the data import, the join, the ggplot...
When writing a question, please only include the minimal elements that lead to a problem. In your case, we only needed a sample dataset with the deaths and POP2019 columns, and the two lines of code that you tried to fix at the end.
If you look at str(covid19j) you'll see that the deaths column is a character column containing a lot of blanks. You need to figure out the structure of that column to read it properly.

Backtesting in R for time series

I am new to the backtesting methodology - algorithm in order to assess if something works based on the historical data.Since I am new to that I am trying to keep things simple in order to understand it.So far I have understood that if let's say I have a data set of time series :
date = seq(as.Date("2000/1/1"),as.Date("2001/1/31"), by = "day")
n = length(date);n
class(date)
y = rnorm(n)
data = data.frame(date,y)
I will keep the first 365 days that will be the in sample period in order to do something with them and then I will update them with one observation at the time for the next month.Am I correct here ?
So if I am correct, I define the in sample and out of sample periods.
T = dim(data)[1];T
outofsampleperiod = 31
initialsample = T-outofsampleperiod
I want for example to find the quantile of alpha =0.01 of the empirical data.
pre = data[1:initialsample,]
ypre = pre$y
quantile(ypre,0.01)
1%
-2.50478
Now the difficult part for me is to update them in a for loop in R.
I want to add each time one observation and find again the empirical quantile of alpha = 0.01.To print them all and check the condition if is greater than the in sample quantile as resulted previously.
for (i in 1:outofsampleperiod){
qnew = quantile(1:(initialsample+i-1),0.01)
print(qnew)
}
You can create a little function that gets the quantile of column y, over rows 1 to i of a frame df like this:
func <- function(i,df) quantile(df[1:i,"y"],.01)
Then apply this function to each row of data
data$qnew = lapply(1:nrow(data),func,df=data)
Output (last six rows)
> tail(data)
date y qnew
392 2001-01-26 1.3505147 -2.253655
393 2001-01-27 -0.5096840 -2.253337
394 2001-01-28 -0.6865489 -2.253019
395 2001-01-29 1.0881961 -2.252701
396 2001-01-30 0.1754646 -2.252383
397 2001-01-31 0.5929567 -2.252065

Convert data on pre-post repeated measures from long to wide by filtering data to get time point as value

I have a 14K row table of 370 liver transplant patients with transplant date and various repeated lab tests done before and after the procedure. I want to get pre-transplant, immediate post-transplant, and 3/6/12/18/24/36 month lab results.
ID
Transp Date
Lab Units
Lab Type
Tme
Lab Val
0000001
2011-01-11
VCA IgG Index
0
6487.0
0000001
2011-01-11
VCA IgM Index
0
11230.0
0000002
2011-01-03
Copies/mL
CMV Quant PCR
3
100.0
0000002
2011-01-03
Copies/mL
EBV Quant PCR
3
683.0.
I did round(datediff) of transplant date and lab test date to get the month timepoint (Tme). My client wants the final table to have one record and all data values per row. Headers something like this:
ID|TrnsplDate|LabType1|Units1|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36|LabType2|Units2|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36|LabType3|Units3|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36|LabType4|Units4|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36
Can anyone knowledgeable in R guide me on where to start? I use Rstudio. Thanks in advance.
Try this, which will put everything for the same ID on one line, then you can adjust column names and order as needed using colnames(df) and indexing (i.e., something like colorder <- c(2,3,5,1,7,12,...); df[, colorder].
### Set up data
library(lubridate)
df <- data.frame(ID = rep(sprintf("SID%s",seq(1:2)),2),
transdate = seq(mdy("01/01/2000"), mdy("01/4/2000"),1),
labunits = c(NA, NA, rep("Copies/mL",2)),
labtype = c(rep("VCA IgG Index",2),"CMV Quant PCR", "EBV Quant PCR"),
time = c(0,0,2,2),
labval = sample(100:2000, 4))
# Transform
df2 <- tidyr::pivot_wider(df, names_from = labtype, values_from = -ID)
# ----------------------------
# Edit: separate by lab type
df_bylab <- split(df, df$labtype)
# output each lab type to CSV
for(i in 1:length(df)){
write.csv(df[i], paste0(names(df)[i], ".csv"))
}
Based on the limited data provided, I am not sure if some columns could be collapsed (for instance, it seems like you only have one date per SID, so all the date columns could be collapsed to one column.
I would also like to point out this is not an ideal structure for the data, so perhaps your best bet is to try to convince your client otherwise!

Dot Plots with multiple categories - R

I'm definitely a neophyte to R for visualizing data, so bear with me.
I'm looking to create side-by-side dot plots of seven categorical samples with many gene expression values corresponding with individual gene names. mydata.csv file looks like the following
B27 B28 B30 B31 LTNP5.IFN.1 LTNP5.IFN.2 LTNP5.IL2.1
1 13800.91 13800.91 13800.91 13800.91 13800.91 13800.91 13800.91
2 6552.52 5488.25 3611.63 6552.52 6552.52 6552.52 6552.52
3 3381.70 1533.46 1917.30 2005.85 3611.63 4267.62 5488.25
4 2985.37 1188.62 1051.96 1362.32 2717.68 2985.37 5016.01
5 1917.30 2862.19 2625.29 2493.26 2428.45 2717.68 4583.02
6 990.69 777.97 1269.05 1017.26 5488.25 5488.25 4267.62
I would like each sample data to be organized in its own dot plot in one graph. Additionally, if I could point out individual data points of interest, that would be great.
Thanks!
You can use base R, but you need to convert to matrix first.
dotchart(as.matrix(df))
or, we can transpose the matrix to arrange it by sample:
dotchart(t(as.matrix(df)))
Considering your [toy] data is stored in a data frame called a:
library(reshape2)
library(ggplot2)
a$trial<-1:dim(a)[1] # also, nrow(a)
b<-melt(data = a,varnames = colnames(a)[1:7],id.vars = "trial")
b$variable<-as.factor(b$variable)
ggplot(b,aes(trial,value))+geom_point()+facet_wrap(~variable)
produces
What we did:
Loaded required libraries (reshape2 to convert from wide to long and ggplot2 to, well, plot); melted the data into long formmat (more difficult to read, easier to process) and then plotted with ggplot.
I introduced trial to point to each "run" each variable was measured, and so I plotted trial vs value at each level of variable. The facet_wrap part puts each plot into a subplot region determined by variable.

Using ggplot2, connect x- and y-coordinates by a third variable

I would like to plot latitude vs longitude and connect the points via date and time, which I have stored in an object of class POSIXlt. I have many, many GPS points, but here is a small set of them that I would like to plot using ggplot2.
My data are like so:
Description lat lon
6/16/2012 17:22 12.117017 -89.69692
6/17/2012 9:15 12.1178 -89.69675
6/17/2012 9:33 12.117783 -89.69673
6/17/2012 10:19 12.11785 -89.69665
6/17/2012 10:45 12.11775 -89.69677
6/17/2012 11:22 12.1178 -89.69673
6/17/2012 11:39 12.117817 -89.69662
6/17/2012 11:59 12.117717 -89.69677
6/17/2012 12:10 12.117717 -89.69655
6/16/2012 16:38 12.11795 -89.6965
6/16/2012 18:29 12.1178 -89.69688
6/16/2012 17:11 12.117417 -89.69703
6/16/2012 17:36 12.116967 -89.69668
6/16/2012 17:50 12.117217 -89.69695
6/16/2012 18:02 12.117583 -89.69715
6/16/2012 18:15 12.11785 -89.69665
6/16/2012 18:27 12.117683 -89.69632
I have a map that I am plotting these points onto.
I can plot the points just fine
plot1 <- map + geom_point(data=dat, aes(x = lon, y = lat))
map is an object I made with ggmap, but it's not that important to include here.
The following code produces a line connecting points as lon increases
plot1+geom_line(data=dat, aes(x=lon,y=lat,colour="red"))
I can't figure out how to connect the points by the vector POSIXlt object Description
I know that in this small example I could easily reorder the points using something like dat2 <- dat[with(dat, order(Description)), ], and remake plot1 using dat2 and make the desired plot using the following code:
plot1+geom_path(data=dat2, aes(x = lond, y = latd, colour="red"))
But for my much larger (hundreds of thousands of observations) dataset, this doesn't make sense as a solution without a bit more work to properly id each observation, which I will certainly end up doing anyway as part of additional data exploration.
Is there an argument I haven't discovered in geom_line for telling R how to connect the points?
I am admittedly still a novice at using ggplot2, and so, I apologize if I have missed something very simple. I have been working on a lot of other code and learning, or at least using, several other packages, to work with this GPS data other spatial data available. It's all a bit overwhelming... So many ideas, so little know-how! The larger point of this is to visualize (and eventually analyze) movement patterns and use of space by my study organisms, but for now, it would be great to visualize the data in a variety of ways to really get familiar with it.
If you have any recommended packages for working with spatial data and GPS data, I'd love to hear about them, as well.
You need the rows ordered by the date/time object to use geom_path. Since I think this is the best way to display the data we should focus on finding an efficient way to sort a large dataset. Obviously it would be good to get an idea of the scale of dataset you are working with. Millions of rows? Billions perhaps?!
Fortunately the data.table package does this very well indeed. Here is an example on a 1 million row table, with an ID column X, which the table is originally sorted on, an unsorted time column of 1 second observations, and two random columns for x and y, which takes < 1s on my laptop t sort according to date/time:
set.seed(123)
require(data.table)
# Rows ordered on X, random order of unique date/time values of 1 second observations
df <- data.frame( ID = seq.int(1e6) , Desc = as.POSIXct(sample(1e6),origin=Sys.Date()) , x = runif(1e6) , y = runif(1e6) )
head(df)
# ID Desc x y
#1 1 2013-05-25 02:39:39 0.2363783 0.1387404
#2 2 2013-05-25 23:58:17 0.1192702 0.1284918
#3 3 2013-05-21 17:41:57 0.8599183 0.6301114
#4 4 2013-05-23 16:12:42 0.8089243 0.7919304
#5 5 2013-05-21 08:17:28 0.8197109 0.4568693
#6 6 2013-05-22 17:57:23 0.4611204 0.5358536
# Convert to data.table
DT <- data.table(df)
# Sort on 'Desc'
setkey(DT , Desc)
head(DT)
# ID Desc x y
#1: 544945 2013-05-18 01:00:01 0.7052422 0.52030877
#2: 886165 2013-05-18 01:00:02 0.2256636 0.04391553
#3: 893690 2013-05-18 01:00:03 0.1860687 0.30978506
#4: 932276 2013-05-18 01:00:04 0.6305562 0.65188810
#5: 407622 2013-05-18 01:00:05 0.5355992 0.98146120
#6: 138936 2013-05-18 01:00:06 0.5999025 0.81722902
# Make data.frame to from this to use with ggplot2 (not sure if you can't just use the data.table directly)
df2 <- DT
So in your case you can try something like:
datDT <- data.table(dat)
setkey(datDT , Description)
dat2 <- datDT

Resources