Transferring categorical means to a new table - r

I'm fairly new to R, but I've tackled much larger challenges than my current problem, which makes it particularly frustrating. I searched the forums and found some related topics, but none would do the trick for this situation.
I've got a dataset with 184 observations of 14 variables:
> head(diving)
tagID ddmmyy Hour.GMT. Hour.Local. X0 X3 X10 X20 X50 X100 X150 X200 X300 X400
1 122097 250912 0 9 0.0 0.0 0.3 12.0 15.3 59.6 12.8 0.0 0 0
2 122097 260912 0 9 0.0 2.4 6.9 5.5 13.7 66.5 5.0 0.0 0 0
3 122097 260912 6 15 0.0 1.9 3.6 4.1 12.7 39.3 34.6 3.8 0 0
4 122097 260912 12 21 0.0 0.2 5.5 8.0 18.1 61.4 6.7 0.0 0 0
5 122097 280912 6 15 2.4 9.3 6.0 3.4 7.6 21.1 50.3 0.0 0 0
6 122097 290912 18 3 0.0 0.2 1.6 6.4 41.4 50.4 0.0 0.0 0 0
This is tagging data, with each date having one or more 6-hour time bins (not a continuous dataset due to transmission interruptions). In each 6-hour bin, the depths to which the animal dived are broken down, by %, into 10 bins. So X0 = % of time spent between 0-3m, X3= % of time spent between 3-10m, and so on.
What I want to do for starters is take the mean % time spent in each depth bin and plot it. To start, I did the following:
avg0<-mean(diving$X0)
avg3<-mean(diving$X3)
avg10<-mean(diving$X10)
avg20<-mean(diving$X20)
avg50<-mean(diving$X50)
avg100<-mean(diving$X100)
avg150<-mean(diving$X150)
avg200<-mean(diving$X200)
avg300<-mean(diving$X300)
avg400<-mean(diving$X400)
At this point, I wasn't sure how to then plot the resulting means, so I made them a list:
divingmeans<-list(avg0, avg3, avg10, avg20, avg50, avg100, avg150, avg200, avg300, avg400)
boxplot(divingmeans) sort of works, providing 1:10 on the X axis and the % 0-30 on the y axis. However, I would prefer a histogram, as well as the x-axis providing categorical bin names (e.g. avg3 or X3), rather than just a rank 1:10.
hist() and plot() provide the following:
> plot(divingmeans)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' is a list, but does not have components 'x' and 'y'
> hist(divingmeans)
Error in hist.default(divingmeans) : 'x' must be numeric
I've also tried:
> df<-as.data.frame(divingmeans)
> df
X3.33097826086957 X3.29945652173913 X8.85760869565217 X17.6461956521739 X30.2614130434783
1 3.330978 3.299457 8.857609 17.6462 30.26141
X29.3565217391304 X6.44510869565217 X0.664130434782609 X0.135869565217391 X0.0016304347826087
1 29.35652 6.445109 0.6641304 0.1358696 0.001630435
and
> df <- data.frame(matrix(unlist(divingmeans), nrow=10, byrow=T))
> df
matrix.unlist.divingmeans...nrow...10..byrow...T.
1 3.330978261
2 3.299456522
3 8.857608696
4 17.646195652
5 30.261413043
6 29.356521739
7 6.445108696
8 0.664130435
9 0.135869565
10 0.001630435
neither of which provide the sort of table I'm looking for.
I know there must be a really basic solution for converting this into an appropriate table, but I can't figure it out for the life of me. I'd like to be able to make a basic histogram showing the % of time spent in each diving bin, on average. It seems the best format for the data to be in for this purpose would be a table with two columns: col1=bin (category; e.g. avg50), and col2=% (numeric; mean % time spent in that category).
You'll also notice that the data is broken up into different timing bins; eventually I'd like to be able to separate out the data by time of day, to see if, for example, the average diving depths shift between day/night, and so on. I figure that once I have this initial bit of code worked out, I can then do the same by time-of-day by selecting, for example X0[which(Hour.GMT.=="6")]. Tips on this would also be very welcome.

I think you will find it far easier to deal with the data in long format.
You can reshape using reshape. I will use data.table to show how to easily calculate the means by group.
library(data.table)
DT <- data.table(diving)
DTlong <- reshape(DT, varying = list(5:14), direction = 'long',
times = c(0,3,10,20,50,100,150,200,300,400),
v.names = 'time.spent', timevar = 'hours')
timeByHours <- DTlong[,list(mean.time = mean(time.spent)),by=hours]
# you can then plot the two column data.table
plot(timeByHours, type = 'l')
You can now analyse by any combination of date / hour / time at depth

How would you like to plot them?
# grab the means of each column
diving.means <- colMeans(diving[, -(1:5)])
# plot it
plot(diving.means)
# boxplot
boxplot(diving.means)
If youd like to grab the lower bound to intervals from the column names, siply strip away the X
lowerIntervalBound <- gsub("X", "", names(diving)[-(1:5)])
# you can convert these to numeric and plot against them
lowInts <- as.numeric(lowerIntervalBound)
plot(x=lowInts, y=diving.means)
# ... or taking log
plot(x=log(lowInts), y=diving.means)
# ... or as factors (similar to basic plot)
plot(x=factor(lowInts), y=diving.means)
instead of putting the diving means in a list, try putting them in a vector (using c).
If you want to combine it into a data.frame:
data.frame(lowInts, diving.means)
# or adding a row id if needed.
data.frame(rowid=seq(along=diving.means), lowInts, diving.means)

Related

Dataframe operation within and between dataframes

How can I make some operation within and between dataframes in R?
For example, here is a dataframe on stock returns.
stocks <- data.frame(
time=as.Date('2009-01-01') + 0:9,
X=rnorm(10, 0, 1),
Y=rnorm(10, 0, 2),
Z=rnorm(10, 0, 4)
)
Date X Y Z
1 2009-01-01 -0.31758501 -1.2718424 -2.9979292
2 2009-01-02 -1.06440187 0.4202969 -5.7925412
3 2009-01-03 0.26475736 -2.3955779 -2.2638179
4 2009-01-04 -0.83653746 0.4161053 -10.1011995
5 2009-01-05 -0.12214392 0.7143456 3.6851497
6 2009-01-06 -0.01186287 -2.1322029 -0.1577852
7 2009-01-07 0.27729415 0.1323237 -4.4237673
8 2009-01-08 -1.74389562 0.4962045 0.4192498
9 2009-01-09 0.83150240 -0.9241747 -1.6752324
10 2009-01-10 -0.52863956 0.1044531 -1.2083588
Q1) I'd like to create a dataframe with previous day.
For example, final result that I want would be expressed lag(stocks,1)
What is the most simple and elegant way to achieve this?
Is there any simple way to use dplyr?
Q2) How can I apply any basic arithmetic operation to this dataframe?
for example, I'd like to create dataframes with,
stocks1 = stocks + 1
stocks2 = stocks x 3
stocks3 = stocks2 / stocks1 (operation between two dataframes)
stocks4 = stocks3 / lag(stocks1)
Something like this.
What would be the most simple and elegant way?
To address the first problem, this might be of help to you. You don't necessarily need to use dplyr in this instance, using the head() function should be sufficient if all you wish to do is lag the variables.
stocks <- data.frame(
time=as.Date('2009-01-01') + 0:9,
X=rnorm(10, 0, 1),
Y=rnorm(10, 0, 2),
Z=rnorm(10, 0, 4)
)
previous<-head(stocks,9)
df<-data.frame(stocks$time[2:10],stocks$X[2:10],stocks$Y[2:10],stocks$Z[2:10],previous$X,previous$Y,previous$Z)
col_headings<-c("time","X","Y","Z","previousX","previousY","previousZ")
names(df)<-col_headings
Here, the dates from 2nd January to 10th January are displayed, with the lags for X, Y, and Z also being included in the data frame.
> df
time X Y Z previousX previousY
1 2009-01-02 0.7878110 -2.1394047 0.68775794 -0.0759606 1.2863089
2 2009-01-03 -0.2767296 -2.3453356 -1.56313888 0.7878110 -2.1394047
3 2009-01-04 -0.2122021 0.1589629 -1.13926020 -0.2767296 -2.3453356
4 2009-01-05 0.1195826 3.2320352 -0.32020803 -0.2122021 0.1589629
5 2009-01-06 0.7642622 -0.7621168 1.66614679 0.1195826 3.2320352
6 2009-01-07 -0.3073972 -2.9475654 5.63945611 0.7642622 -0.7621168
7 2009-01-08 0.3597369 0.5011861 5.95424269 -0.3073972 -2.9475654
8 2009-01-09 -1.8701881 0.4417496 1.34273218 0.3597369 0.5011861
9 2009-01-10 -1.1172033 -0.5566736 0.05432339 -1.8701881 0.4417496
previousZ
1 3.2188050
2 0.6877579
3 -1.5631389
4 -1.1392602
5 -0.3202080
6 1.6661468
7 5.6394561
8 5.9542427
9 1.3427322
As regards calculations, it depends on what you are trying to do.
e.g. do you want to add 1 to each row in Z?
> df$Z+1
[1] 1.6877579 -0.5631389 -0.1392602 0.6797920 2.6661468 6.6394561
[7] 6.9542427 2.3427322 1.0543234
You could divide two stock returns by each other as you've specified as well. Note that we have combined them in the one dataframe, so we are not necessarily conducting an "operation between two dataframes" per se.
> df$Y/df$Z
[1] -3.11069421 1.50040132 -0.13953168 -10.09354826 -0.45741275
[6] -0.52266839 0.08417294 0.32899307 -10.24740160
By specifying the dataframe (in this case, df), along with the associated variable (as indicated after the $ symbol), then you should be able to carry out a wide range of calculations across the dataframe.

R: Is it possible to combine rows of non-equal length into a single data frame using a for-loop?

I have been working with a dataset (called CWNA_clim_vars) structured so that the variables associated with each datapoint within the set are arranged in columns, like this:
dbsid elevation Tmax04 Tmax10 Tmin04 Tmin10 PPT04 PPT10
0001 1197 8.1 8.9 -5.2 -3.5 34 95
0002 1110 7.7 8 -2.9 -0.6 114 375
0003 1466 5.4 6.4 -4.7 -1.5 199 453
0004 1267 6.1 7.1 -3.6 -0.7 166 376
... ... ... ... ... ... ... ...
1000 926 7.2 10.1 -0.8 2.7 245 351
I've been attempting to on each column run boxplot stats, retrieve the values of the outliers within each column, and write them to a new data frame, called summary_stats. The code I set up in attempt to achieve this is as follows:
summary_stats <- data.frame()
for (i in names(CWNA_clim_vars)){
temp <- boxplot.stats(CWNA_clim_vars[,i])
out <- as.list(temp$out)
for (j in out) {
summary_stats[i,j] <- out[j]
}
}
Unfortunately, in running this, the following error message is thrown:
Error in `[<-.data.frame`(`*tmp*`, i, j, value = list(6.65)) :
new columns would leave holes after existing columns
I am guessing that it is because the number of outliers varies between columns that this error message is being thrown, as if instead I replace temp$out with temp$n, which contains one number only per column, produced is a data frame having these numbers arranged in a single column.
Is there a way of easily remedying this so that I end up with a data frame having rows which are not necessarily of the same length? Thanks for considering my question - any help I would appreciate greatly.
You'd better use a "list".
out_lst <- lapply(CWNA_clim_vars, function (x) boxplot.stats(x)$out)
If for some reason you have to present it in a "data frame", you need padding.
N <- max(lengths(out_lst))
out_df <- data.frame(lapply(out_lst, function (x) c(x, rep(NA, N - length(x)))))
Try with a tiny example:
CWNA_clim_vars <- data.frame(a = c(rep(1,9), 10), b = c(10,11,rep(1,8)))

Loop that matches row to column names and computes an average of the 3 preceding columns

Im trying to make some computations in R. I have a dataset where in the columns i have id, startdate and then every day date from 2014 till 2017.
Now every id has a different start date. Accompanied for every date are concentrations of a chemical specific for an individual id.
A sample from my data looks like this:
id time 20140101 20140102 20140103 20140104 20140105 20140106 20140107
1 1 20141119 2.6 2.5 4.1 4.8 3.1 1.8 3.5
2 4 20150403 1.7 1.6 2.8 3.4 2.0 1.2 1.9
3 7 20140104 2.2 2.2 3.7 4.4 2.6 1.3 2.9
4 8 20141027 2.7 2.5 4.1 4.9 3.3 1.8 3.6
5 9 20141112 2.6 2.4 3.9 4.7 3.1 1.7 3.4
Now what i would like to do is to run a script that loops trough each row id and time combo eg "1 20141119" or "8 20141027", and matches the date numbers to the colnames and give me the corresponding concentration values.
so the combo "7 20140104" gives me the concentration 4.4
After this i would like to do the same but then take the date and make a 3 day average preceding the time date. So for the combo "7 20140104" make an average of the dates 20140102 20140103 20140104 concentrations for id 7
I made a small test data frame
id <- 12:18
date <- c("c","d","e","f","c","d","e")
a <- rnorm(7, 2, 1)
b <- rnorm(7, 2, 1)
c <- rnorm(7, 2, 1)
d <- rnorm(7, 2, 1)
e <- rnorm(7, 2, 1)
f <- rnorm(7, 2, 1)
df <- data.frame(id, date, a, b, c, d, e, f)
This was my solution for the first part of my question.
for(i in 1:nrow(df)){
conc <- df[i, df[i,"date"]==colnames(df)]
print(conc)
}
which works enough for the first part, but currently i don't know how to do the 3 day average. If you have tips on how to do the first part more nicely im all ears.
Hopefully you people can help me.
Thanks very much for your help.
If I've understood the question correctly, given a value, you want to get the next to values at that row and return the mean of the 3 values.
Assuming that these date columns are in order, I've adapted your loop to include what I think you are after. Not the most elegant code but I've tried to lay it out in a step-by-step manor:
for (i in 1:1) {
conc <- df[i, df[i,"date"]==colnames(df)]
conPos <- which(df[i,"date"]==colnames(df)) # Get the position
av <- df[i, (conPos:(conPos+2))] # Get the next to columns values
print(rowMeans(av)) # Get the average
}
Potentially a more efficient way to do this (depending on the size of your dataset) is to instead of a for loop, use an apply function. Something such as:
apply (df, MARGIN = 1, FUN = function(x, i){
position <- (which(x[['date']] == colnames(df)))
threeDayAverage <- as.numeric((x[(position:(position+2))]))
print(sum(threeDayAverage) / 3)
})

R/Plotly: Error in list2env(data) : first argument must be a named list

I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}

Blank information after merge

R version 3.2.1
I'm following this article on how to create Heat Maps
and downloaded the required files (just need help with the logic)
This works
setwd("D:/GIS/london")
library(maptools)
library(ggplot2)
library(gpclib)
sport <- readShapeLines("london_sport.shp")
When I run
names(sport)
I get normal output, i.e.
[1] "ons_label" "name" "Partic_Per" "Pop_2001"
And when I print sport
print(sport)
I get this (showing first 6 lines)
geometry ons_label name Partic_Per Pop_2001
0 MULTILINESTRING((541177.7 173555.7 ...)) 00AF Bromley 21.7 295535
1 MULTILINESTRING((522957.6 178071.3 ...)) 00BD Richmond upon Thames 26.6 172330
2 MULTILINESTRING((505114.9 184625.1 ...)) 00AS Hillingdon 21.5 243006
3 MULTILINESTRING((552108.3 194151.8 ...)) 00AR Havering 17.9 224262
4 MULTILINESTRING((519370.8 163657.4 ...)) 00AX Kingston upon Thames 24.4 147271
5 MULTILINESTRING((525554.3 166815.8 ...)) 00BF Sutton 19.3 179767
6 MULTILINESTRING((513062.8 178187.2 ...)) 00AT Hounslow 16.9 212352
So far everything looks normal. I understand the next few lines, to create plot points
p <- ggplot(sport#data, aes(Partic_Per, Pop_2001))
p + geom_point(aes(color="Partic_Per", size = "Pop_2001")) + geom_text(size=2, aes(label=name))
gpclibPermit()
Then we make the shape file (sport) into a data frame
sport_geom <- fortify(sport, region="ons_label")
And when I execute head(sport_geom) I get
long lat order piece group id
1 541177.7 173555.7 1 1 0.1 0
2 541872.2 173305.8 2 1 0.1 0
3 543441.5 171429.9 3 1 0.1 0
4 544361.6 172379.2 4 1 0.1 0
5 546662.4 170451.9 5 1 0.1 0
6 548187.1 170582.3 6 1 0.1 0
Then the command to merge data, where the problem arises
sport_geom <- merge (sport_geom, sport#data, by.x = "id", by.y = "ons_label")
When I go according to the document and print by., the only values that pop up are by.data.frame and by.default
And when I execute head(sport_geom), the data is blank!!!
[1] id long lat order piece group
name Partic_Per Pop_2001 <0 rows> (or 0-length row.names)
What am I missing and how to troubleshoot this?
Perhaps this is an error in the document (this is the simplest tutorial I can find on Heat Maps), because I was troubleshooting another error for the past hour.
Please help!!!!!! I know it's long, but it's the best I could do to explain the issue and possibly reproduce it.
If you need links to download the stuff, I'll get it for you.

Resources