I have a very simple question and I hope you can help me.
I have a dataset with monthly temperatures from 1958 to 2020. This gives me a total of 756 observations, which matches with the amount of months.
This is the only column I have, and I would like to add a column with the date in format month-year, starting from 01-1958 in the first observation, following 02-1958, 03-1958...... 12-2020.
Any ideas?
Thank you very much!
Two things:
I think a Date object would be much better (there is no Month object), since it has natural number-like properties that allows you to find differences, plot without bias, etc. Note that stored this way, every other representation can be derived trivially for reports/renders.
Even if you must go with a string, I suggest putting year first so that sorting works as expected.
You offered no data, so I'll make something up:
mydata <- data.frame(val = 1:756)
mydata$date <- seq(as.Date("1958-01-01"), length.out=756, by="month")
mydata$ym_chr <- format(mydata$date, format = "%Y-%m")
mydata$my_chr <- format(mydata$date, format = "%m-%Y")
mydata[c(1:5, 752:756),]
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 2 2 1958-02-01 1958-02 02-1958
# 3 3 1958-03-01 1958-03 03-1958
# 4 4 1958-04-01 1958-04 04-1958
# 5 5 1958-05-01 1958-05 05-1958
# 752 752 2020-08-01 2020-08 08-2020
# 753 753 2020-09-01 2020-09 09-2020
# 754 754 2020-10-01 2020-10 10-2020
# 755 755 2020-11-01 2020-11 11-2020
# 756 756 2020-12-01 2020-12 12-2020
As a quick demonstrating that we are looking at exactly (no more, no fewer) than one month per year, all months, all years, here's a quick table:
table(year=gsub(".*-", "", mydata$my_chr), month=gsub("-.*", "", mydata$my_chr))
# month
# year 01 02 03 04 05 06 07 08 09 10 11 12
# 1958 1 1 1 1 1 1 1 1 1 1 1 1
# 1959 1 1 1 1 1 1 1 1 1 1 1 1
# 1960 1 1 1 1 1 1 1 1 1 1 1 1
# ...
# 2018 1 1 1 1 1 1 1 1 1 1 1 1
# 2019 1 1 1 1 1 1 1 1 1 1 1 1
# 2020 1 1 1 1 1 1 1 1 1 1 1 1
All snipped rows are identical in all but the year, i.e., all 1s. The sum(.) of this is 756. (Just checking since I wanted to make sure I was doing it right.)
Lastly, to highlight my comment about sorting, here are some examples premised on the knowledge that val is incrementing from 1.
head(mydata[order(mydata$ym_chr),])
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 2 2 1958-02-01 1958-02 02-1958
# 3 3 1958-03-01 1958-03 03-1958
# 4 4 1958-04-01 1958-04 04-1958
# 5 5 1958-05-01 1958-05 05-1958
# 6 6 1958-06-01 1958-06 06-1958
head(mydata[order(mydata$my_chr),])
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 13 13 1959-01-01 1959-01 01-1959
# 25 25 1960-01-01 1960-01 01-1960
# 37 37 1961-01-01 1961-01 01-1961
# 49 49 1962-01-01 1962-01 01-1962
# 61 61 1963-01-01 1963-01 01-1963
If being able to sort by date is important, than I suggest it will be much simpler to use either $date or the string $ym_chr.
Related
I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
I am having an issue with the arulesSequences package in R. I was able to read baskets into the program, and create a data.frame, however it fails to recognize any other items beyond the first column. Below is a sample of my data set, which follows the form demonstrated here: Data Mining Algorithms in R/Sequence Mining/SPADE.
[sequenceID] [eventID] [SIZE] items
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
The dataset was exported from SAS as a [.CSV] then converted to a tab-delimited [.TXT] file in Excel. Headers were removed for import into R, but I placed them in brackets above for clarity in this example. All spaces were replaced with an underscore ("_"), and item names were simplified as much as possible. Each item is listed in a separate column. The following command was used to import the file:
baskets <- read_baskets(con = "...filepath/spade.txt", sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
I am presented with no errors, so I continue with the following command:
as(baskets, "data.frame")
Here, it returns the data.frame as requested, however it fails to capture the items beyond the first column:
items sequenceID eventID SIZE
{OB/Gyn} 2 1 1
{Internal_Medicine} 15 1 1
{Internal_Medicine} 15 2 1
{Internal_Medicine} 15 3 1
{Internal_Medicine} 56 1 2
{Oncology} 84 1 1
{Hematology} 151 1 2
{Hematology/Oncology} 151 2 1
{Hematology/Oncology} 151 3 1
{Gastroenterology} 185 1 2
Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
I have tried importing the file directly as a [.CSV], but the data.frame results in a similar format to my above attempt using tabs, except it places a comma in front of the first item:
{,Internal_Medicine} 56 1 2
Any troubleshooting suggestions would be greatly appreciated. It seems like this package is picky when it comes to formatting.
Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
Check out
library(arulesSequences)
packageVersion("arulesSequences")
# [1] ‘0.2.16’
packageVersion("arules")
# [1] ‘1.5.0’
txt <- readLines(n=10)
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
writeLines(txt, tf<-tempfile())
baskets <- read_baskets(con = tf, sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
as(baskets, "data.frame")
# items sequenceID eventID SIZE
# 1 {OB/Gyn} 2 1 1
# 2 {Internal_Medicine} 15 1 1
# 3 {Internal_Medicine} 15 2 1
# 4 {Internal_Medicine} 15 3 1
# 5 {Internal_Medicine,Neurology} 56 1 2 # <----------
# 6 {Oncology} 84 1 1
# 7 {Hematology} 151 1 2
# 8 {Hematology/Oncology} 151 2 1
# 9 {Hematology/Oncology} 151 3 1
# 10 {Gastroenterology} 185 1 2
I have data on a large number of individuals and there may be multiple observations per person. I want to deduplicate the data into 'episodes' of 28 days for each individual. I want to drop those records where the date of the observation is 28 days or less than the date of the start of the prior episode.
Some sample data on 6 observations of a single individual are below. The duplicate and new_episode variables are dummy variables and are not present in the original data and indicate the logic of the example.
dat <- data.frame(id = rep(1, 6), spec_n = seq(1,6,1),
spec_date = as.Date(c("2016/01/01", "2016/01/02", "2016/01/30",
"2016/01/31", "2016/02/02", "2016/02/28")),
duplicate = c(0,1,0,1,1,0), new_episode = c(1,0,1,0,0,1),
stringsAsFactors = FALSE)
dat
id spec_n spec_date duplicate new_episode
1 1 1 2016-01-01 0 1
2 1 2 2016-01-02 1 0
3 1 3 2016-01-30 0 1
4 1 4 2016-01-31 1 0
5 1 5 2016-02-02 1 0
6 1 6 2016-02-28 0 1
With dplyr I can calculate the time since the last observation and the time since the first episode. So deduplicating on date_diff would not provide the data I require.
library(dplyr)
dat <- dat %>% group_by(id) %>%
mutate(date_diff = spec_date - lag(spec_date),
earliest_spec_date = min(spec_date),
diff_earliest = spec_date - earliest_spec_date)
dat
id spec_n spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl> <dbl> <date> <dbl> <dbl> <time> <date> <time>
1 1 1 2016-01-01 0 1 NA days 2016-01-01 0 days
2 1 2 2016-01-02 1 0 1 days 2016-01-01 1 days
3 1 3 2016-01-30 0 1 28 days 2016-01-01 29 days
4 1 4 2016-01-31 1 0 1 days 2016-01-01 30 days
5 1 5 2016-02-02 1 0 2 days 2016-01-01 32 days
6 1 6 2016-02-28 0 1 26 days 2016-01-01 58 days
However, this does not quite provide what I need. spec_n == 6 is less than 28 days since the previous observation, but more than 28 days since the start of the last episode (spec_n == 3).
Expected output would be those rows where duplicate is 0 or new_episode is 1, e.g.
id spec_n spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl> <dbl> <date> <dbl> <dbl> <time> <date> <time>
1 1 1 2016-01-01 0 1 NA days 2016-01-01 0 days
2 1 3 2016-01-30 0 1 28 days 2016-01-01 29 days
3 1 6 2016-02-28 0 1 26 days 2016-01-01 58 days
This should work (its an implementation of the idea Llopis suggested I think).
I make some simulated data first:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by=1), data=rnorm(365))
head(df)
date data
1 2015-01-01 -1.4493544
2 2015-01-02 -0.8860342
3 2015-01-03 1.3629541
4 2015-01-04 -2.0131108
5 2015-01-05 -0.4527413
6 2015-01-06 0.8428585
Now we write a function that takes the first date and checks if subsequent dates are more than 28 days distant from it, returning 0 if they are not and 1 if they are. If a date is 28 days away it takes that new date as the basis of future comparisons.
dupFinder <- function(x) {
n <- 1
N <- length(x)
res <- rep(1, N)
start <- x[n]
while (n < (N)) {
if (as.numeric(x[n+1]-start)>=28) {
res[n+1] <- 1
n <- n+1
start <- x[n]
}
else {
res[n+1] <- 0
n <- n+1
}
}
return(res)
}
The function dupFinder will return a vector of length equal to that of your dataframe, and you can then use it to subset the dataframe to the rows of interest. Thus:
df[dupFinder(df$date)==1,]
date data
1 2015-01-01 -1.4493544
29 2015-01-29 0.2084123
57 2015-02-26 1.4541566
85 2015-03-26 0.6794230
113 2015-04-23 -0.8285670
141 2015-05-21 -0.8686872
169 2015-06-18 2.1657994
197 2015-07-16 -1.1802231
225 2015-08-13 0.1808395
253 2015-09-10 -0.4762835
281 2015-10-08 -0.3769593
309 2015-11-05 0.2825544
337 2015-12-03 -0.7132649
365 2015-12-31 -1.8111226
As expected we start with the January 1, then January 29, then Feb 26, since Feb has 28 days we next get March 26th, etc.
I have an R script that allows me to select a sample size and take fifty individual random samples with replacement. Below is an example of this code:
## Creates data frame
df = as.data.table(data)
## Select sample size
sample.size = 5
## Creates Sample 1 (Size 5)
Sample.1<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.1$Sample <- c("01")
According to the R script above, I first created a data frame. I then select my sample size, which in this case is 5. This represents just one sample. Due to my lack of experience with R, I repeat this code 49 more times. The last piece of code looks like this:
## Creates Sample 50 (Size 5)
Sample.50<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.50$Sample <- c("50")
The sample output would look something like this (Sample Range 1 - 50):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 8800 50
1 3800 50
1 10400 50
1 2200 50
1 29000 50
It should be noted that varaible 'Num' was created for grouping purposes and has little to no influence on my overall question (which is posted below).
Instead of repeating this code fifty times, to get me fifty individual samples (with a size of 5), is there a loop I can create to help me limit my code? I have been recently asked to create ten thousand random samples, each of a size of 5. I obviously cannot repeat this code ten thousand times so I need some sort of loop.
A sample of my final output should look something like this (Sample Range 1 - 10,000):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 9900 10000
1 8300 10000
1 10700 10000
1 6800 10000
1 31000 10000
Thank you all in advance for your help, its greatly appreciated.
Here is some sample code if needed:
Num Dollars
1 31002
1 13728
1 23526
1 80068
1 86244
1 9330
1 27169
1 13694
1 4781
1 9742
1 20060
1 35230
1 15546
1 7618
1 21604
1 8738
1 5299
1 12081
1 7652
1 16779
A very simple method would be to use a for loop and store the results in a list:
lst <- list()
for(i in seq_len(3)){
lst[[i]] <- df[sample(seq_len(nrow(df)), 5, replace = TRUE),]
lst[[i]]["Sample"] <- i
}
> lst
[[1]]
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
[[2]]
Num Dollars Sample
9 1 4781 2
13 1 15546 2
12 1 35230 2
17 1 5299 2
12.1 1 35230 2
[[3]]
Num Dollars Sample
1 1 31002 3
7 1 27169 3
17 1 5299 3
5 1 86244 3
6 1 9330 3
Then, to create a single data.frame, use do.call to rbind the list elements together:
do.call(rbind, lst)
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
9 1 4781 2
13 1 15546 2
121 1 35230 2
17 1 5299 2
12.1 1 35230 2
11 1 31002 3
7 1 27169 3
171 1 5299 3
5 1 86244 3
6 1 9330 3
It's worth noting that if you're sampling with replacement, then drawing 50 (or 10,000) samples of size 5 is equivalent to drawing one sample of size 250 (or 50,000). Thus I would do it like this (you'll see I stole a line from #beginneR's answer):
df = as.data.table(data)
## Select sample size
sample.size = 5
n.samples = 10000
# Sample and assign groups
draws <- df[sample(seq_len(nrow(df)), sample.size * n.samples, replace = TRUE), ]
draws[, Sample := rep(1:n.samples, each = sample.size)]