R: Function to repeat counting words in strings with different arguments - r

I am counting the sum of words of strings with specific arguments e.g. for weeks (week 1 = 1, week 2 = 2 and so on) with the following command:
sum(data[which(data[,17]==1), 19])
[,17] is the column in the data frame of the numeric argument of the week which has to be 1 for week 1
, 19 is the column in the data frame of the number of words of each string
I have 31 weeks and 228.000 strings and I do not want to execute each command for each week seperately so I am searching for a function which can do it automatically for week 1-31 and gives me the results.
Thanks for helping!

Related

How to calculate the difference in months between two dates in TOSCA?

Date 1: 10/25/2020
Date 2: 01/25/2021
Difference = 3
What is the formula to find the difference between two dates in months in TOSCA?
WEEKDAY :
To find the day of the week of a given date
{CALC[WEEKDAY(DATE(2018,3,18))]}
Expected result: For the above-given date, the result will be 1 (Sunday)
Sunday= 1, Monday=2, Tuesday=3,
Wednesday=4, Thursday=5, Friday=6 Saturday=7
DATEDIF :
To find the difference between two given dates.
{CALC[DATEDIF(DATE(2018,2,10),DATE(2018,2,21), """d""")]}
Expected result: Above expression will give you the results as 11 (Difference between the given dates is 11 days.)
IF :
Given two dates we find which date is bigger amongst the two dates
{CALC[IF(DATE(2018,1,3)>DATE(2018,3,24),"""True""","""False""")]}
Expected results: This expression gives you the result as false as the first given date is smaller than the second one.

Replacement of missing day and month in dates using R

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

Printing row names for values in a matrix

Im having problems printing the rowname for specific values within a matrix. The following two questions have been difficult.
On which day(s) did she arrive the fastest in the first week? (Only the day(s) of the week should print. (Hint: Use the row names.)
Determine the day(s) of the second week on which she arrived to work within a half an hour. (Only the day(s) of the week should print.)
This is the data set called commutes
Week1 Week2
Monday 26 22
Tuesday 35 23
Wednesday 24 36
Thursday 31 32
Friday 34 25
1) You can use the which() function to find the index of the smallest value in the first column. You provide which() with a logical object (in this case, a vectorized equal test). Supposing you have your matrix bound to m:
ind = which(m[,'Week1'] == min(m[,'Week1']))
You can then take the use the index to get the row name matching that logical using rownames():
day = rownames(m)[ind]
2) This is essentially the same thing, except you will be expecting a vector of indices rather than a single index. Again use which() to find the indices which match the desired logical expression:
inds = which(m$Week2 < 30)
days = rownames(m)[inds]

Nested loop in R not giving expected outputs

I wanted to use the nested loop below to work out a variable 'data' for every day within a number of years.
x is a vector of length 20 (number of years) and each of the 20 entries is the number of days the inner loop is to run for.
I also have a vector 'start' that has 20 dates in the format "1981-02-01".
I wanted to create a matrix of the output (data) that would have the data for each day in rows and then one column per year.
The code I am using below however does not seem to be updating the counters (yrcntr and daycntr) which is causing the whole thing to not work.
Also, when I try to assign values to 'data' within the loop using the counters as indices (data[daycntr yrcntr]),it's not working.
I'm not even getting an error.
I'm not sure how to write out the format of 'data' used below here, but I'll give it a go:
datamat=
tmax tmin date
11 4 "1981-03-31"
13 6 "1981-04-01"
12 7 "1981-04-02"
and 'start' is a vector of dates in the format: `"1981-04-02" "1981-04-03"
tmax<-datamat[,1]
tmin<-datamat[,2]
tdates<-datamat[,3]
yrcntr<-0;
daycntr<-0;
for (yr in 1:length(x)){
yrcntr<-yrcntr+1
#find the row in the temp data that matches the startdate each year
tempidx<- (which(tdates==start[yrcntr]))-1
for (days in 1:numdays[yr]){
daycntr<-daycntr+1
dlytempidx=tempidx+1
data[daycntr yrcntr]<- (tmax[dlytempidx]+tmin[dlytempidx])
}
rm(tempidx)
}

Handling SPELL data with exact dates (feature request?)

I'm learning TraMineR and have used different types of longitudinal data. My original data is SPELL data with id, start time, end time and status, where the start and end times are exact dates, so my subsequences have varying lengths
With seqformat() I can chop the data (automatically) into 1 year pieces and convert into STS format, eg. where first variable is the first date, second variable is the first date + 1 year and so on.
What I would like to do is adjust the conversion so that I could use half year or one month time periods.
Here I have converted the dates into years with decimals with decimal.date():
id start end status
1 1 1965.138 1965.974 1
2 1 1968.714 1987.237 1
3 1 1985.667 2003.933 2
4 1 1988.499 1988.665 1
5 1 1996.652 1996.878 1
The sequence object that is created automatically has the data in one year subsequences:
$ y1960.16803278689
$ y1961.16803278689
$ y1962.16803278689
$ y1963.16803278689
So with data with dates I would like to have the option to use also shorter than 1 year subsequence lengths. I understand that with seqgranularity() the opposite is possible.
Alternatively I'm interested to know if there's some way in R outside TraMineR to handle the SPELL data to create certain length subsequences.

Resources