I'm a bit new to R so apologies up front if its not explained as clear as it should be. I have 6 excel sheets within a single workbook (Trees_2020, Trees_2017, Trees_2014, Trees_2011, Trees_2008, Trees_2003). These contain plot IDs (ID_Plot), within plot tree ID numbers (ID_tree) and then growth data (DBH_mm). The problem is the tree IDs do not remain the same through the years but are linked based on their old ID (Field_Mapping software recognises them based on location but assigns a new number which is linked to the Old_ID).
What I'm trying to do is merge all the sheets linking the years together based on the plot ID and then the Old_ID to current ID.
2020 Data Example
2017 Data Example
You can see in the 2020 sheet a column linking to the Old_ID number of 2017 and this is true of all sheets. Trees that are recorded for the first time do not have an Old_ID number in that first recording.
The ideal output would be a single sheet where a unique identifier is added for each tree, the DBH of each tree for each year linked together based on the plot_ID and the within plot ID_tree (coupled based on Old_ID)
Ideal Output
Apologies if thats very confusing but I struggled to explain it in a simpler way. I've been playing with tidyverse and loops but can't seem to figure it out so any help greatly appreciated!
Related
I have recently received an output from the online survey (ESRI Survey123), storing the each recored attribte as a new column of teh table. The survey reports characteristics of single trees located on study site: e.g. beech1, beech2, etc. For each beech, several attributes are recorded such as height, shape, etc.
This is how the output table looks like in Excel. ID simply represent the site number:
Now I wonder, how can I read those data into R to make sure that columns 1:3 belong to beech1, columns 4:6 represent beech2, etc.? I am looking for something that would paste the beech1 into names of the following columns: beech1.height, beech1.shape. But I am not sure how to do it?
I am very new to R programming, and I have a few datasets that I'm playing around with. One of the things I'm trying to do is use ggplot to graph what percentage of the population in each state voted in the 2016 election.
The first csv file I have contains an estimate of the population of each state in 2016, and the second csv file I have contains the number of votes cast by each party in the 2016 election. I'm not sure how to attach the file here, so I will show some screenshots:
2016 Election Votes:
2016 State Populations:
From what I understand, I can read the 2016 election votes csv file, and create a new column that contains the total votes using something like:
electionVotes$TotalVotes <- electionVotes$DemocraticCandidates + electionVotes$RepublicanCandidates + electionVotes$OtherCandidates
Once I have that, I would like to create a column where I do something like:
electionVotes$PercentVoted <- electionVotes$TotalVotes / *number of people per state*
I understand how to use ggplot to display the results, but what is confusing to me is how I can accurately use these tables with each other when one State column uses an abbreviation for the state name, like "AL", while the other one uses ".Alabama".
Any thoughts on what would be the best process to do this other than manually editing the csv file? Thank you!
You could bring in a table (http://app02.clerk.org/menu/ccis/Help/CCIS%20Codes/state_codes.html) that links the abbreviations to the full state names and use that to join the two datasets.
I need to sort my data by date. Previously I had one dataset and used select and filter to create two separate datasets, one with data from June 30 or earlier and the other with data from July 1 or later. However, my problem is that I seem to have lost some rows during this process - I went from 1390 rows in my original dataset to 1335 rows between the two new datasets. I can't figure out what happened.
What I am trying to do now is use my original dataset, ethica_surveys and create a new column. I want to call this column pre_post. I know how to create a new column, but I want to filter the data into this column based on my date parameters. So the rows containing pre should be dated June 30 or earlier, and those containing post should be dated July 1 or later. I am filtering based on the variable response_time, but I am just unsure of how to code this in RStudio.
Thanks in advance for any help you can provide.
This seemed to work after a lot of trial and error.
ethica_surveys$pre_post <-
if_else(ethica_surveys$response_time > as.Date("2018-07-
01"),"post","pre")
I'm somewhat new to R so i apologize in advance if the answer to this question is obvious. I have a very long data frame (only one variable) from which i want to create multiple objects from subsets within the data frame. The code to scrape the data & format as data frame 'aa', define the variable as 'whatever':
aa<-data.frame(readLines("ftp://ftp.cmegroup.com/pub/settle/stlint"))
aa<-data.frame(aa[-1:-3,])
colnames(aa)<-"whatever"
I am looking to subset each section under a heading beginning with 'ZE' and ending with the last data row before the next 'ZE' or before the 'TOTAL'... so basically i want 36 objects (length(grep("ZE",aa$whatever[1:nrow(aa)]))=36) each starting with their respective 'ZE' title followed by (roughly) 70 rows of data, with each object identified by their respective title. So for instance, I would want the first dataset (headed by row ZE MAR15 EURODOLLAR OPTIONS CALL) to be named some variant of 'March 2015 Calls' as i just need to denote the month, year, and whether the data is for calls or puts.
I can actually code this up in batch thru a loop, but here's my problem: right now of course the first 'ZE' month is Mar15, ie March 2015, and the last 'ZE' month is Dec18, or Dec 2015. This will change as time goes on though, and i'm hoping to be able to automatically name them based on the first line without tweaking the script when the months change for each contract. So is it possible to flexibly name each of these subsets based on the content of the header?
Thanks
I have 7 different variable in an excel spreadsheet that I have imported into R. They each are columns with a size of 3331. They are:
'Tribe' - there are 8 of them
'Month' - when the sampling was carried out
'Year' - the year when the sampling was carried out
'ID" - an identifier for each snail
'Weight' - weight of a snail in grams
'Length' - length of a snail shell in millimetres
'Width' - width of a snail shell in millimetres
This is a case where 8 different tribes have been asked to record data on a suspected endangered species of snail to see if they are getting rarer, or changing in size or weight.
This happened at different frequencies between 1993 and 1998.
I would like to know how to be able to create a new variables to the data so that if I entered names(Snails) # then it would list the 7 given variables plus any added variable that I have.
The dataset is limited to the point where I would like to add new variables. Such as, knowing the counts per month of snails in any given month.
This would rely on me using - Tribe,Month,Year and ID. Where if an ID (snail identifier) was were listed according to the rates in any given month then I would be able to sum them to see if there are any changes in counts. I have tried:
count=c(Tribe,Year,Month,ID)
count
But, after doing things like that, R just has a large list of that is 4X the size of the dataset. I would like to be able to create a given new variable that is of column size n=3331.
Or maybe I would like to create a simpler variable so I can see if a tribe collected at any given month. I don't know how I can do this.
I have looked at other forums and searched but, there is nothing that I can see that helps me in my case. I appreciate any help. Thanks
I'm guessing you need to organise your variables in a single structure, such as a data.frame.
See ?data.frame for the help file.
To get you started, you could do something like:
snails <- data.frame(Tribe,Year,Month,ID)
snails
# or for just the first few rows
head(snails)
Then this would have your data looking similar to your Excel file like:
Tribe Year Month ID
1 1 1 1 a
2 2 2 2 b
3 3 3 3 c
<<etc>>
Then if you do names(snails) it will list out your column names.
You could possibly avoid some of this mucking about by just importing your Excel file either directly from Excel, or saving as a csv (comma separated values) file first and then using read.csv("name_of_your_file.csv")
See http://www.statmethods.net/input/importingdata.html for some more specifics on this.
To tabulate your data, you can do things like...
table(snails$Tribe)
...to see the number of snail records collected by each tribe. Or...
table(snails$Tribe,snails$Year)
...to see the trends in each tribe by each year. The $ character will let you access the named variable (column) inside a data.frame in the same way you are currently using the free floating variables. This might seem like more work initially, but it will pay off greatly when you need to do some more involved analysis.
Take for example if you want to only analyse the weights from tribe "1", you could do:
snails$Weight[snails$Tribe==1]
# mean of these weights
mean(snails$Weight[snails$Tribe==1])
There are a lot more things I could explain but you would probably be better served by reading an excellent website like Quick-R here: http://www.statmethods.net/management/index.html to get you doing some more advanced analysis and plotting.