I am a beginner using R, and I am wanting to create a dataframe that stores a range of dates to their respective classified time period.
paleo.periods <- c("Paleoindian","Early Paleoindian", "Middle Paleoindian", "Late Paleoindian", "Archaic","Early Archaic", "Middle Archaic","Late Archaic","Woodland","Early Woodland","Middle Woodland","Late Woodland","Late Prehistoric")
paleo.dates <- c(c(13500,8000), c(13500,10050) ,c(10050,9015), c(9015,8000), c(8000,2500), c(8000,5500), c(5500,3500), c(3500,2500), c(2500,1150), c(2500,2000), c(2000,1500), c(1500,1150), c(1150,500))
I would like for the arrangement to come out where I can refer to a given time period, ex: "Late Woodland", and get the associated vector of it's beginning and end timeframes, ex: (1500,1150)
I tried simply doing this by
paleo.seg <- data.frame(paleo.periods,paleo.dates)
however, this creates 3 variables: a list of the periods, a list of the vectors, and paleo.dates. I am not sure why it is creating 3 variables, as I'd like it to be only 2: paleo.periods and paleo.dates. I would also like to refer to them as paleo.seg$paleo.periods which will return the list of periods (and later use this to somehow refer to the periods individually), same with the dates.
Essentially I would like my dataframe to look a bit like this:
paleoperiods paleodates
"Late Woodland" 1500,1100
Therefore I could look specifically for the string "Late Woodland" and find the vector dates. I tried doing this on my current data.frame, and
"Woodland" %in% paleo.seg returns false. So I feel like I am misunderstanding how to build a proper dataframe, as well as being able to match one categorical variable to two dates.
There are a few ways that you could go about this depending on your reasoning about what you want to do with your dataframe. My recommendation would actually be to split the dates column into two separate date columns(start and end I believe, from your description). This way you could calculate or use rules based on the dates. I've found this useful when looking at data, as it gives you the ability to filter based on two different aspects of the date. If you would like them to be in the same column, you could make the dates a character in order to have them in the same column. However, this approach does have drawbacks in terms of using it for exploratory data analysis. An example of this would be:
paleo.dates <- c("13500,8000","13500,10050","10050,9015","9015,8000", ...)
This would allow you to look up Late Woodland and get "1500,1100", but you wouldn't be able to search for periods occurring after 1500 if that type of analysis is something you would be doing at a later point.
Related
I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))
I have a data frame consisting of three variables named momentum returns(numeric),volatility (factor) and market states (factor). Volatility and market states both have two -two levels. Volatility have levels named high and low. Market states have level named positive and negative I want to make a two sorted table. I want mean of momentum returns in every case.
library(wakefield)
mom<-rnorm(30)
vol<-r_sample_factor(30,x=c("high","low"))
mar_state<-r_sample_factor(30,x=c("positive","negtive"))
df<-data.frame(mom,vol,mar)
Based on the suggestion given by #r2evans if you want mean of every sorted cases you can apply following code.
xtabs(mom~vol+mar,aggregate(mom~vol+mar,data=df,mean))
## If you want simple sum in every case
xtabs(mom~vol+mar,data=df)
You can also do this with help of data.table package. This approach will do same task in less time.
library(data.table)
df<-as.data.table(df)
## if you want results in data frame format
df[,.(mean(mom)),by=.(vol,mar)]
## if you want in simple vector form
df[,mean(mom),by=vol,mar]
Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.
I have some classified raster layers as categorical land cover maps. All the layers having exactly the same categories (lets say: "water", "Trees", "Urban","bare soil") but they are from different time points (e.g. 2005 and 2015)
I load them into memory using the raster function like this:
comp <- raster("C:/workingDirectory4R/rasterproject/2005marsh3.rst")
ref <- raster("C:/workingDirectory4R/rasterproject/2013marsh3.rst")
"comp" is the comparison map at time t+1 and "ref" is the reference map from time t. Then I used the crosstab function to generate the confusion table. This table can be used to explore the changes in categories through the time interval.
contingency.Matrix <- crosstab(comp, ref)
The result is in the matrix format with the "comp" categories in the column and "ref" in the rows. And column and row names labeled with numbers numbers 1 to 4.
Now I have 2 questions and I really appreciate any help on how to solve them.
1- I want to assign the category names to the columns and rows of
the matrix to facilitate it's interpretation.
2- Now let's say I have three raster layers for 2005, 2010 and 2015.
This means I would have two confusion tables one for 2005-2010 and
another one for 2010-2015. What's the best procedure to automate
this process with the minimal interaction from user.
I thought to ask the user to load the raster layers, then the code save them in a list. Then I ask for a vector of years from the user but the problem is how can I make sure that the order of raster layers and the years are the same? And is there a more elegant way to do this.
Thanks
I found a partial answer to my first question. If the categorical map is created in TerrSet(IDRISI) software with the ".rst" extention then I can extract the category names like this:
comp <- raster("C:/rasterproject/2005subset.rst")
attributes <- data.frame(comp#data#attributes)
categories <- as.character(attributes[,8])
and I get a vector with the name of categories. However if the raster layers are created with a different extension then the code won't work. For instance if the raster is created in ENVI then the third line of the code should get changed to:
categories <- as.character(attributes[,2])
I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)