Merge rows which have the same date within a data frame [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a data.frame as follows:
timestamp index negative positive sentiment
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-10-29 15:00:10 0 11 10 -1
2 2015-10-29 17:26:48 0 1 5 4
3 2015-10-29 17:30:07 0 10 22 12
4 2015-10-29 20:13:22 0 5 6 1
5 2015-10-30 14:25:26 0 3 2 -1
6 2015-10-30 18:22:30 0 14 15 1
7 2015-10-31 14:16:00 0 10 23 13
8 2015-11-02 20:30:18 0 14 7 -7
9 2015-11-03 14:15:00 0 8 26 18
10 2015-11-03 16:52:30 0 12 34 22
I would like to know if there is a possibility to merge rows with equal days such that i have a scoring for each day, since I have absolutely no clue how to approach this problem because I dont even know how to unlist each date and write a function which merges only equal dates, because the time differs in each day . I would like to obtain a data.frame which has the following form:
timestamp index negative positive sentiment
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-10-29 0 27 43 16
2 2015-10-30 0 3 2 -1
3 2015-10-31 0 17 17 0
4 2015-11-02 0 14 7 -7
5 2015-11-03 0 20 60 40
Is there any possibility to get around to this result? I would be thankful for any hint.

You can use aggregate() to do this. Before doing that, you'll need to show that it should sort according to the day, ignoring the exact time-point.
I will assume you have your data stored as df:
aggregate(df[ ,2:5], FUN="sum", by=list(as.Date(df$timestamp, "%Y-%m-%d")))

Related

Merge datasets using common and uncommon/time-varying variables

I am trying to merge multiple datasets based on subject IDs and dates of measurement. The subject IDs are common across datasets (but some datasets may contain subject IDs that others do not). The dates of measurement are uncommon i.e. they differ between datasets. I am trying to match entries for the same subjects between datasets that were recorded within 14 days of each other.
Here is an example of a pair of datasets I am trying to merge. I assume that I need to match two datasets at a time, as is customary in data matching functions such as merge in Stata or R.
Dataset 1:
ID t x1 x2 x3
1 01/01/2019 7957 0 1
1 31/01/2019 6991 0 1
1 02/03/2019 4242 0 1
1 26/03/2019 9459 0 1
1 30/03/2019 5584 0 1
2 04/02/2020 9142 1 3
2 29/02/2020 8208 1 3
2 12/03/2020 9260 1 3
3 12/03/2019 8919 1 2
3 25/03/2019 4694 1 2
3 16/04/2019 1393 1 2
4 25/03/2020 . 0 0
4 22/04/2020 . 0 0
5 02/04/2018 7537 1 1
5 29/04/2018 9172 1 1
5 19/05/2018 4914 1 1
5 22/06/2018 846 1 1
6 06/04/2020 3493 1 5
6 29/04/2020 9894 1 5
6 22/05/2020 7034 1 5
7 02/01/2022 8142 0 2
7 07/02/2022 7891 0 2
Dataset 2:
ID t y1 x4
1 16/01/2019 22 0
1 01/02/2019 16 0
1 06/03/2019 18 0
1 29/03/2019 13 0
2 17/03/2020 22 1
4 06/04/2020 17 0
4 14/05/2020 15 0
4 17/05/2020 23 0
4 22/05/2020 19 0
4 24/05/2020 16 0
5 10/03/2018 . .
5 17/04/2018 . .
5 14/05/2018 . .
5 07/06/2018 . .
6 06/04/2020 12 1
6 22/05/2020 15 1
7 22/01/2022 24 0
7 09/03/2022 27 0
8 22/02/2021 11 .
8 24/02/2021 14 .
8 28/02/2021 16 .
The merged dataset:
ID t1 t2 tdiff x1 x2 x3 y1 x4
1 31/01/2019 01/02/2019 -1 6991 0 1 16 0
1 02/03/2019 06/03/2019 -4 4242 0 1 18 0
1 30/03/2019 29/03/2019 1 5584 0 1 13 0
2 12/03/2020 17/03/2020 -5 9260 1 3 22 1
4 25/03/2020 06/04/2020 -12 . . . 17 0
5 29/04/2018 17/04/2018 12 9172 1 1 . .
5 19/05/2018 14/05/2018 5 4914 1 1 . .
6 06/04/2020 06/04/2020 0 3493 1 5 12 1
6 22/05/2020 22/05/2020 0 7034 1 5 15 1
t1 reflects the date of measurement in dataset 1; t2 reflections the date of measurement in dataset 2; tdiff reflects the difference in days between t1 and t2. There should be no values in tdiff that >|14|. The periods reflect missing values.
As you can see, only those entries that were recorded within +/-14 days of each other for a given subject have been merged on a 1:1 basis. There is an instance where two entries in dataset 2 fall within 14 days of one entry in dataset 1 for subject ID 1. In cases like this, I would like to take the pair of entries that are closest in date (e.g., out of 26/03/2019 and 29/03/2019 in dataset 2, the latter is closer to 30/03/2019 in dataset 1). There may be more than two entries in one dataset that fall within 14 days of an entry in the other dataset; again I would be looking to save the pair of entries that are closest in time. There are some subjects that are not included in the merged dataset as they are not included in both datasets (e.g., subject 4 in dataset 1 and subject 8 in dataset 2).
All variables including the dates of measurement (t) from each dataset have been carried over (e.g., x1-x3 in dataset 1 and y1 and x4). Each dataset has a different number of variables to merge. There are 12 datasets to merge in total, which I envision doing in pairs. Also, subjects vary in how many data entries they have recorded within a dataset (e.g., subject 1 has 5 entries whereas subject 7 only has 2) and between datasets (e.g., subject 2 has three entries recorded in dataset 1, but only one entry in dataset 2).
I have had a few thoughts so far but feel lost in how to implement something like this:
The data is currently in long format but there is no reason why we cannot transpose to wide format. It might be easier if a subject's data was on one row?
Ideally we would have a standardized time variable that could be used to match measurements across datasets. I have thought of creating a variable that reflects the difference between an absolute starting time point and the date for a given entry, and then converting this measure into months, but we still have the problem that time variable is not the same across datasets.
As for programs for implementation, I am using Stata, R and Excel for data management and analysis.
Your guidance is greatly appreciated!

Creating a percentage column based on the sums of a column grouped by a different column? [duplicate]

This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 9 months ago.
I am wrangling with a huge dataset and my R skills are very new. I am really trying to understand the terminology and processes but finding it a struggle as the R-documentation often makes no sense to me. So apologies if this is a dumb question.
I have data for plant species at different sites with different percentages of ground-cover. I want to create a new column PROP-COVER which gives the proportion of each species' cover as a percentage of the total cover of all species in a particular site. This is slightly different to calculating percentage cover by site area as it is disregards bare ground with no vegetation. This is an easy calculation with just one site, but I have over a hundred sites and need to perform the calculation on species ground-cover grouped by site. The desired column output is PROP-COVER.
SPECIES SITE COVER PROP-COVER(%)
1 1 10 7.7
2 1 20 15.4
3 1 10 7.7
4 1 20 15.4
5 1 30 23.1
6 1 40 30.8
2 2 20 22.2
3 2 50
5 2 10
6 2 10
1 3 5
2 3 25
3 3 40
5 3 10
I have looked at for loops and repeat but I can't see where the arguments should go. Every attempt I make returns a NULL.
Below is an example of something I tried which I am sure is totally wide of the mark, but I just can't work out where to begin with or know if it is even possible.
a<- for (i in data1$COVER) {
sum(data1$COVER[data1$SITE=="i"],na.rm = TRUE)
}
a
NULL
I have a major brain-blockage when it comes to how 'for' loops etc work, no amount of reading about it seems to help, but perhaps what I am trying to do isn't possible? :(
Many thanks for looking.
In Base R:
merge(df, prop.table(xtabs(COVER~SPECIES+SITE, df), 2)*100)
SPECIES SITE COVER Freq
1 1 1 10 7.692308
2 1 3 5 6.250000
3 2 1 20 15.384615
4 2 2 20 22.222222
5 2 3 25 31.250000
6 3 1 10 7.692308
7 3 2 50 55.555556
8 3 3 40 50.000000
9 4 1 20 15.384615
10 5 1 30 23.076923
11 5 2 10 11.111111
12 5 3 10 12.500000
13 6 1 40 30.769231
14 6 2 10 11.111111
In tidyverse you can do:
df %>%
group_by(SITE) %>%
mutate(n = proportions(COVER) * 100)
# A tibble: 14 x 4
# Groups: SITE [3]
SPECIES SITE COVER n
<int> <int> <int> <dbl>
1 1 1 10 7.69
2 2 1 20 15.4
3 3 1 10 7.69
4 4 1 20 15.4
5 5 1 30 23.1
6 6 1 40 30.8
7 2 2 20 22.2
8 3 2 50 55.6
9 5 2 10 11.1
10 6 2 10 11.1
11 1 3 5 6.25
12 2 3 25 31.2
13 3 3 40 50
14 5 3 10 12.5
The code could also be written as n = COVER/sum(COVER) or even n = prop.table(COVER)

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

how to create a new date (month, year) data in R

I have a very simple question and I hope you can help me.
I have a dataset with monthly temperatures from 1958 to 2020. This gives me a total of 756 observations, which matches with the amount of months.
This is the only column I have, and I would like to add a column with the date in format month-year, starting from 01-1958 in the first observation, following 02-1958, 03-1958...... 12-2020.
Any ideas?
Thank you very much!
Two things:
I think a Date object would be much better (there is no Month object), since it has natural number-like properties that allows you to find differences, plot without bias, etc. Note that stored this way, every other representation can be derived trivially for reports/renders.
Even if you must go with a string, I suggest putting year first so that sorting works as expected.
You offered no data, so I'll make something up:
mydata <- data.frame(val = 1:756)
mydata$date <- seq(as.Date("1958-01-01"), length.out=756, by="month")
mydata$ym_chr <- format(mydata$date, format = "%Y-%m")
mydata$my_chr <- format(mydata$date, format = "%m-%Y")
mydata[c(1:5, 752:756),]
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 2 2 1958-02-01 1958-02 02-1958
# 3 3 1958-03-01 1958-03 03-1958
# 4 4 1958-04-01 1958-04 04-1958
# 5 5 1958-05-01 1958-05 05-1958
# 752 752 2020-08-01 2020-08 08-2020
# 753 753 2020-09-01 2020-09 09-2020
# 754 754 2020-10-01 2020-10 10-2020
# 755 755 2020-11-01 2020-11 11-2020
# 756 756 2020-12-01 2020-12 12-2020
As a quick demonstrating that we are looking at exactly (no more, no fewer) than one month per year, all months, all years, here's a quick table:
table(year=gsub(".*-", "", mydata$my_chr), month=gsub("-.*", "", mydata$my_chr))
# month
# year 01 02 03 04 05 06 07 08 09 10 11 12
# 1958 1 1 1 1 1 1 1 1 1 1 1 1
# 1959 1 1 1 1 1 1 1 1 1 1 1 1
# 1960 1 1 1 1 1 1 1 1 1 1 1 1
# ...
# 2018 1 1 1 1 1 1 1 1 1 1 1 1
# 2019 1 1 1 1 1 1 1 1 1 1 1 1
# 2020 1 1 1 1 1 1 1 1 1 1 1 1
All snipped rows are identical in all but the year, i.e., all 1s. The sum(.) of this is 756. (Just checking since I wanted to make sure I was doing it right.)
Lastly, to highlight my comment about sorting, here are some examples premised on the knowledge that val is incrementing from 1.
head(mydata[order(mydata$ym_chr),])
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 2 2 1958-02-01 1958-02 02-1958
# 3 3 1958-03-01 1958-03 03-1958
# 4 4 1958-04-01 1958-04 04-1958
# 5 5 1958-05-01 1958-05 05-1958
# 6 6 1958-06-01 1958-06 06-1958
head(mydata[order(mydata$my_chr),])
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 13 13 1959-01-01 1959-01 01-1959
# 25 25 1960-01-01 1960-01 01-1960
# 37 37 1961-01-01 1961-01 01-1961
# 49 49 1962-01-01 1962-01 01-1962
# 61 61 1963-01-01 1963-01 01-1963
If being able to sort by date is important, than I suggest it will be much simpler to use either $date or the string $ym_chr.

Creating an summary dataset with multiple objects and multiple observations per object

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

Resources