Is there a way to track change in response over time and map that onto changes in response of another category in r? - r

(I'm SUPER new to coding in general so all suggestions are much appreciated.)
So I'm working with a data set that contains panel survey data that was posed to the same 8000 participants 7 times over the course of the last decade. I currently have dummy variable forms for the answers I'm interested in, so now my data is looks like this:
colour2011
colour2016
colour2018
1
1
0
0
0
0
0
1
1
1
0
0
1
1
1
and the other variable's data looks similar with column names being tied to the year the question was asked. Is there a way to not only show change of answer for both using ggplot2, but also track rate of change and display that visually by year?

Related

How to count number of events of length 5<=N<10 days meeting a set condition using CDO?

I need a help regarding CDO operation on a netcdf file. I downloaded dataset for 40 years from ERA5 over a grid region and I masked variable values for a range (30-50) to 1 and other values to 0 using cdo.
cdo -expr,'var2=var*(var>=30 && var<50)' data1.nc data2.nc
Now I want to calculate number of times each grid cell recorded var2= 1 consecutively for 5 days but less than 10 days in the last 40 years. Is that possible using cdo or nco?
First of all, I'm assuming your input has been converted to daily, you don't say.
Then you also need to clarify the question. Your title originally said "how to count the number of days?", but that was a bit ambiguous?
Let's say you have a series like this that represents an 8 day event:
0 0 1 1 1 1 1 1 1 1 0 0 0 0
Does that count as a single occurrence? Your text seemed to imply this was the case, but the title not, I think you wanted to know the "number of events" not "days", so I edited your title to agree with the main text of the question, I hope this interpretation is correct.
I think you can do it but the solution is a bit longwinded. You can use runsum to give you a "1" for any day which is 1 and is on the end of a series of N days like this:
cdo gec,N -runsum,N in.nc out5.nc
But that doesn't totally answer your question. For example if N=5 this would convert the above series into this:
0 0 0 0 0 0 1 1 1 1 0 0 0 0
i.e there are 4 days on the end of a 5 day series.
How can we get an upper limit to the length of an event? Well if we do the same calculation for >10 day events, and add together, we get
not an event
An event of at least 5 days but less than 10,
An event that is >10 (and of course >5 days)
So we just add the two series and pick out the 1s to get the range of event lengths you require:
cdo gec,10 -runsum,10 in.nc out10.nc
# only keep events of 5,6,7,8 and 9 days in length:
cdo eqc,1 -add out5.nc out10.nc out5-10.nc
Okay now we have a file where var=1 when it is at the end of a series of at least 5 but less than ten days.
Now this is cool part, we can apply the same technique using runmean/runsum to pick up the START and END of each of these series, and then we can add up these events. If we apply a runsum with a window size of 2, this produces 1 for a sequence of "0 1" or "1 0" i.e. it picks up the start and end points of each event.
cdo eqc,1 -runsum,2 out5-10.nc out_start_end.nc
This command turns our example series into the following, since we've seen only a sequence of "0 1" or "1 0" results in a 1:
0 0 0 0 0 0 1 0 0 1 0 0 0 0
Now we just need to sum this in time and divide by 2 (I told you it was long winded!)
cdo divc,2 -timsum out_start_end.nc number_of_events.nc
ta da!
Note 1 that if the whole input series ends mid-event e.g. 0 0 1 1 1 , this method will count this an a "half" event, since you only pick up the start. Round down to the nearest integer if this upsets you.
Putting this all together (and you can probably pipe to combine some of this), here is the whole solution involving 10 cdo commands summarized:
cdo gec,5 -runsum,5 in.nc out5.nc
cdo gec,10 -runsum,10 in.nc out10.nc
cdo eqc,1 -add out5.nc out10.nc out5-10.nc
cdo eqc,1 -runsum,2 out5-10.nc out_start_end.nc
cdo divc,2 -timsum out_start_end.nc number_of_events.nc
Note 2, the runsum commands will use the window mid-point for the date/timestamp, but that is not important for this use-case. If anyone wants to also use the outN.nc files to see when the event days are, then it is usual to lag the time stamp using --timestat_date last, see this video for more details.
Note 3 If you sum the series of days within the events, you can now divide this by the number of events to get the mean event length.

Finding Correlations between data in dataframe (including binary)

I have a dataset called dolls.csv that I imported using
dolls <- read.csv("dolls.csv")
This is a snippet of the data
Name Review Year Strong Skinny Weak Fat Normal
Bell 3.5 1990 1 1 0 0 0
Jan 7.2 1997 0 0 1 0 1
Tweet 7.6 1987 1 1 0 0 0
Sall 9.5 2005 0 0 0 1 0
I am trying to run some preliminary analysis of this data. The Name is the name of the doll, the review is a rating 1-10, year is year made, and all values after that are binary where they are 1 if they possess a characteristic or 0 if they don't.
I ran
summary(dolls)
and get the header, means, mins and max's of values.
I am trying to possibly see what the correlations are between characteristics and year or review rating to see if there is some correlation (for example to see if certain dolls have really high ratings yet have unfavorable traits ), not sure how to construct charts or what functions to use in this case? I was considering some ANOVA tail testing for outliers and means of different values but not sure how to compare values like this (In python i'd run a if-then statement but i dont know how to in R).
This is for a personal study I wanted to conduct and improve my R skills.
Thank you!

Static variable next to a dynamic variable in R

I posted yesterday another question but I feel I need to clarify it.
Let's say I have this code
md.NAME <- (subset(MyData, HotelName=="ALAMEDA"))
md.NAME.fc <- (subset(md.ALAMEDA, TIPO=="FORECAST"))
md.NAME.fc.bar <- (subset(md.ALAMEDA.fc, Market.Segment=="BAR"))
What I want is that NAME changes according to a variable set before those 3 lines are run,
So NAME is just dynamic in the sense that before these 3 lines I could say, ok, NAME now is equal to JOHN, but then, I could say that NAME is now equal to PATRIC.
So after running those 3 lines, twice (once for John and once for Patric) somehow in the environment I will get something like this:
6 dataframes, 3 for JOHN and 3 for PATRIC
DATAFRAME 1 WILL BE md.JOHN
DATAFRAME 2 WILL BE md.JOHN.fc
DATAFRAME 3 WILL BE md.JOHN.fc.bar
DATAFRAME 1 WILL BE md.PATRIC
DATAFRAME 2 WILL BE md.PATRIC.fc
DATAFRAME 3 WILL BE md.PATRIC.fc.bar
All the answers I had so far would help me only if "md" and "fc" or "fc.bar" are always the same. But I will have several variables like this, which will change a lot as far as the naming goes. So, it is the center part (NAME) the only one that should change.
I could even have something like:
md.test$NAME <- ...

How can I make a bar chart where I can sub-divide each of the three x-axis categories?

I'm attempting to make a bar graph, using three different columns from my data set to make up the x-axis (Number_DS, Number_US, Number_A), (and 'Number attracted' as the y-axis) with each of these three variables representing data for each of three different fish species, so basically, three categories on the axis, each subdivided into three sub-categories.
The graph below (which I made by summarising data I had on a previous occasion and producing a concise matrix) shows the kind of graph I'm attempting to produce (without the error bars).
In addition, I'm also planning on calculating standard error or deviation to produce error bars for each. However, I'm struggling to find a way to do so with my data in the format that it is in (different to previous occasion). Does anyone have any code that may help sort the data in a way that generating this graph is possible? I've added some of my data below in hopes that it helps this question make more sense!
Thank in advance
Species NumberDS NumberUS NumberAcross Number attracted
Atlantic cod 0 0 92 0
Atlantic cod 0 2 0 0
Haddock 9 0 0 9
Whiting 0 0 4 4
Haddock 0 0 1 0
Whiting 0 1 2 3
I don't know if I got your problem. Assuming df is your data.frame.
sps=split(df,df$Species) #Species is the first column
totals=sapply(sps,function(sp)apply(sp[,-1],2,sum))
bp=barplot(as.matrix(t(totals)),legend.text = TRUE,args.legend=
list(x = "topright",bty="n",cex=.8,ncol=1),
beside=T,col=1:ncol(totals),xaxt="n")
axis(1,at=bp[2,],labels=row.names(totals),las=2,cex.axis=.5,tick = F)
Is that what you want?

Lines between certain points in a plot, based on the data? (with R)

I have done my research and googling but have yet to find a solution to the following problem. I have quite often found solutions to R-related issues from this forum, so I thought I'd give it a try and hope that somebody can suggest something. I would need it for my PhD thesis; anybody who's code or suggestions I will use will naturally be acknowledged and credited.
So: I need to draw lines/segments to connect points in a plot (of multidimensional scaling, specifically) in R (SPSS-based solutions are welcome as well) - but not between all points, just those that represent properties/variables that at least one data item shares - the placement of the lines should be based on the data that the plot in question is based on itself. Let me exeplify; below are some fictional data with dummy variables, where '1' means that the item has the property:
"properties"
a b c
"items" ---------
tree | 1 1 0
house | 0 1 1
hut | 0 1 1
book | 1 0 0
The plot is a multidimensional scaling plot (distances are to be interpreted as dissimilarities). This is the logic:
there's a line between A and B, because there is at least one item/variable ("tree") in
the data that has both properties;
there is a line between B and C, because there is at least one item in the data ("house" and "hut") that has both properties;
there is an item ("book") that has only one property (A), so it does not affect the placement of the lines
importantly, there is no line between A and C because there are no items in the data that have both properties.
What I am looking for is a way to add the grey lines automatically/computationally that I have for now drawn manually on the plot above. The automatic drawing should be based on the data as described above. With a small data set, drawing the lines manually is no problem, but becomes a problem when there are tens of such "properties" and hundreds of items/rows of data.
Any ideas? Some R code (commented if possible) would be most welcome!
EDIT: It seems I forgot something very important. First thing, the solution proposed by #GaborCsardi below works perfectly with the example data, thanks for that! But I forgot to include that the linking of the points should also be "conservative", with as few connecting lines as possible. For example, if there is an item that has all the "properties", then it should not create lines between every single property point in the plot just because of that, if the points are connected by other items already, even if indirectly. So a plot based on the following data should not be a full triangle, even though item1 has all three properties:
A B C
item1 1 1 1
item2 1 1 0
item3 0 1 1
Instead, A,B and B,C should be connected by a line, but a line between A and C would be exessive, as they are already indirectly connected (through B). Could this be done with incidence graphs?
This is very easy if you use graphs, and create the projection of the bipartite graph that you have in your table. E.g.
library(igraph)
## Some example data
mat <- " properties
items a b c
tree 1 1 0
house 0 1 1
hut 0 1 1
book 1 0 0
"
tab <- read.table(textConnection(mat), skip=1,
header=TRUE, row.names=1)
## Create a bipartite graph
graph <- graph.incidence(as.matrix(tab))
## Project the bipartite graph
proj <- bipartite.projection(graph)
## Plot one of the projections, the one you need
## happens to be the second one
plot(proj$proj2)
## Minimum spanning tree of the projection
plot(minimum.spanning.tree(proj$proj2))
For more information see the manual pages, i.e. ?"igraph-package" ?graph.incidence, ?bipartite.projection and ?plot.igraph.

Resources