code to sort csv. Using full date (yyyy-mm-dd) to label with different visits (PACE01, PACE02 etc) - r

I have a csv. I want to load it into R to get the desired outcome (highlighted). For example study id 256 was taken on differing dates (ie. different vists) I want PACE01, PACE02, PACE03 or PACE04 to be added for each recurring visit respectively.
Some years have two visits so I cant just use year. I need to take into account whole date.
Hope this makes sense, I would really appreciate your help. I have 14,000+ samples to sort.
Desired Outcome

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

Extracting from data frame at specific time intervals ....index or posixct?

I currently have a dataframe which includes a running timeline of POSIXct (see below).
Basically given a starting time of my choosing I want to be able to take rows at specific time interval. E.g. Say I start taking rows at four pm I then want to take 9 minutes, then not take anything for the next two, then nine again. I'm guessing the best approach is possibly using indexing but I also thought something like the lubridae package could be used but not sure how to exactly do it.
Thanks!

Can I make a time series work with date objects rather than integers?

I have time series data that I'm trying to analyse in R. It was provided as a CSV from excel, which I subsequently read as a data.frame all. Let's say it has two columns: all$date and all$people, representing the count of people on a particular date. The frequency is hence daily.
Being from Excel, the dates are integers representing the number of days since 1900-01-01.
I could read the data as people = ts(all$people, start=c(all$date[1], 1), frequency=365); but that gives a silly start value of almost 40000 because the data starts in 2006. The start parameter doesn't take a date object, according to ?ts, so I can't just use as.Date():
ts - ...
start: the time of the first observation. Either a single number
or a vector of two integers, which specify a natural time unit and
a (1-based) number of samples into the time unit. See the examples
for the use of the second form.
I could of course set start=1, but it's a bit painful to figure out what season we're in when the plot tells me interesting things are happening around day 2100. (To be clear, setting frequency=365 does tell me what year we're in, but isn't useful more precise dates). Is there a useful way of expressing the date in ts in a human-readable form so that I don't have to keep calling as.Date() to understand when the interesting features are happening?

Creating New Variables in R that relate to

I have 7 different variable in an excel spreadsheet that I have imported into R. They each are columns with a size of 3331. They are:
'Tribe' - there are 8 of them
'Month' - when the sampling was carried out
'Year' - the year when the sampling was carried out
'ID" - an identifier for each snail
'Weight' - weight of a snail in grams
'Length' - length of a snail shell in millimetres
'Width' - width of a snail shell in millimetres
This is a case where 8 different tribes have been asked to record data on a suspected endangered species of snail to see if they are getting rarer, or changing in size or weight.
This happened at different frequencies between 1993 and 1998.
I would like to know how to be able to create a new variables to the data so that if I entered names(Snails) # then it would list the 7 given variables plus any added variable that I have.
The dataset is limited to the point where I would like to add new variables. Such as, knowing the counts per month of snails in any given month.
This would rely on me using - Tribe,Month,Year and ID. Where if an ID (snail identifier) was were listed according to the rates in any given month then I would be able to sum them to see if there are any changes in counts. I have tried:
count=c(Tribe,Year,Month,ID)
count
But, after doing things like that, R just has a large list of that is 4X the size of the dataset. I would like to be able to create a given new variable that is of column size n=3331.
Or maybe I would like to create a simpler variable so I can see if a tribe collected at any given month. I don't know how I can do this.
I have looked at other forums and searched but, there is nothing that I can see that helps me in my case. I appreciate any help. Thanks
I'm guessing you need to organise your variables in a single structure, such as a data.frame.
See ?data.frame for the help file.
To get you started, you could do something like:
snails <- data.frame(Tribe,Year,Month,ID)
snails
# or for just the first few rows
head(snails)
Then this would have your data looking similar to your Excel file like:
Tribe Year Month ID
1 1 1 1 a
2 2 2 2 b
3 3 3 3 c
<<etc>>
Then if you do names(snails) it will list out your column names.
You could possibly avoid some of this mucking about by just importing your Excel file either directly from Excel, or saving as a csv (comma separated values) file first and then using read.csv("name_of_your_file.csv")
See http://www.statmethods.net/input/importingdata.html for some more specifics on this.
To tabulate your data, you can do things like...
table(snails$Tribe)
...to see the number of snail records collected by each tribe. Or...
table(snails$Tribe,snails$Year)
...to see the trends in each tribe by each year. The $ character will let you access the named variable (column) inside a data.frame in the same way you are currently using the free floating variables. This might seem like more work initially, but it will pay off greatly when you need to do some more involved analysis.
Take for example if you want to only analyse the weights from tribe "1", you could do:
snails$Weight[snails$Tribe==1]
# mean of these weights
mean(snails$Weight[snails$Tribe==1])
There are a lot more things I could explain but you would probably be better served by reading an excellent website like Quick-R here: http://www.statmethods.net/management/index.html to get you doing some more advanced analysis and plotting.

Specific date format conversion problems in R

Basically I want to know why as.Date(200322,format="%Y%W") gives me NA. While we are at it, I would appreciate any advice on a data structure for repeated cross-section (aka pseudo-panel) in R.
I did get aggregate() to (sort of) work, but it is not flexible enough - it misses data on columns when I omit the missed values, for example.
Specifically, I have a survey that is repeated weekly for a couple of years with a bunch of similar questions answers to which I would like to combine, average, condition and plot in both dimensions. Getting the date conversion right should presumably help me towards my goal with zoo package or something similar.
Any input is appreciated.
Update: thanks for string suggestion, but as you can see in your own example, %W part doesn't work - it only identifies the year while setting the current day while I need to set a specific week (and leave the day blank).
Use a string as first argument in as.Date() and select a specific weekday (format %w, value 0-6). There are seven possible dates in each week, therefore strptime needs more information to select a unique date. Otherwise the current day and month are returned.
> as.Date(paste("200947", "0", sep="-"), format="%Y%W-%w")
[1] "2009-11-22"

Resources