Merge function produces more rows than original dataframe - r

I have two data frames that look like this.
First one:
head(df_2015_2016)
Date HomeTeam AwayTeam B365H B365D B365A BWH BWD BWA IWH IWD IWA
1 08/08/15 Bournemouth Aston Villa 2.00 3.6 4.00 2.00 3.30 3.70 2.10 3.3 3.30
2 08/08/15 Chelsea Swansea 1.36 5.0 11.00 1.40 4.75 9.00 1.33 4.8 8.30
3 08/08/15 Everton Watford 1.70 3.9 5.50 1.70 3.50 5.00 1.70 3.6 4.70
4 08/08/15 Leicester Sunderland 1.95 3.5 4.33 2.00 3.30 3.75 2.00 3.3 3.60
5 08/08/15 Man United Tottenham 1.65 4.0 6.00 1.65 4.00 5.50 1.65 3.6 5.10
6 08/08/15 Norwich Crystal Palace 2.55 3.3 3.00 2.60 3.20 2.70 2.40 3.2 2.85
And the second one
> head(df_matches)
row_names ID scoresway_id club club_bet city
1 1 242 214 Gent Gent Gent
2 2 248 215 Anderlecht Anderlecht Bruxelles
3 3 243 217 Cercle Brugge Cercle Brugge Brugge
4 4 310 218 Sporting Charleroi Charleroi Charleroi
5 5 249 219 Club Brugge Club Brugge Brugge
6 6 234 222 Beerschot #N/B Antwerp
Now I would like to merge them. The df that I try to merge has 5062 rows
nrow(df_2015_2016)
[1] 5062
However, when I try to merge it
df <- merge(df_2015_2016, df_matches, by.x = "HomeTeam", by.y = "club_bet", all.x = T)
The endresult has 5733 rows.
nrow(df)
[1] 5733
The output that I want is just 5062 rows with a match or NA value is case there is no match.
Any feedback on what goes wrong here?

Related

In R, how do I keep the first single occurrence of a row based on a repeated value in one column?

I want to keep the row with the first occurrence of a changed value in a column (the last column in the example below). My dataframe is an xts object.
In the example below, I would keep the first row with a 2 in the last column, but not the next two because they are unchanged from the first 2. I'd then keep the next three rows (the sequence 323) because they change each time, and remove the next 4 because they didn't change, and so on. The final data frame would look like to smaller one below the original.
Any help is appreciated!
Original Dataframe
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
Final Dataframe
2007-01-31 2.72 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-11-30 5.75 4.76 4
2008-04-30 6.86 4.91 1
2008-07-31 8.00 5.13 4
Here's another solution using run length encoding rle().
lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens + 1,]
Output:
V1 V2 V3 V4
1 2007-01-31 2.72 4.75 2
4 2007-04-30 2.74 4.75 3
5 2007-05-31 2.46 4.75 2
6 2007-06-30 2.98 4.75 3
11 2007-11-30 5.75 4.76 4
16 2008-04-30 6.86 4.91 1
19 2008-07-31 8.00 5.13 4
You can use data.table::shift to filter, plus the first row, in rbind
library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])
Or an equivalent approach using dplyr
library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))
Output:
date v1 v2 v3
<IDat> <num> <num> <int>
1: 2007-01-31 2.72 4.75 2
2: 2007-04-30 2.74 4.75 3
3: 2007-05-31 2.46 4.75 2
4: 2007-06-30 2.98 4.75 3
5: 2007-11-30 5.75 4.76 4
6: 2008-04-30 6.86 4.91 1
7: 2008-07-31 8.00 5.13 4
DATA
x <- "
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
"
df <- read.table(textConnection(x) , header = F)
and use this two lines
df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]
#> V1 V2 V3 V4
#> 1 2007-01-31 2.72 4.75 2
#> 4 2007-04-30 2.74 4.75 3
#> 5 2007-05-31 2.46 4.75 2
#> 6 2007-06-30 2.98 4.75 3
#> 11 2007-11-30 5.75 4.76 4
#> 16 2008-04-30 6.86 4.91 1
#> 19 2008-07-31 8.00 5.13 4
Created on 2022-06-12 by the reprex package (v2.0.1)

Function for finding Temperature at Dissolved Oxygen of 3 (TDO3) value across a whole year

I am looking to calculate the TDO3 value at every date during the year 2020. I have interpolated data sets of both temperature and dissolved oxygen in 0.25 meter increments from 1m - 22m below the surface between the dates of Jan-1-2020 and Dec-31-2020.
TDO3 is the temperature when dissolved oxygen is 3mg/L. Below are snips of the merged data set.
> print(do_temp, n=85)
# A tibble: 31,110 x 4
date depth mean_temp mean_do
<date> <dbl> <dbl> <dbl>
1 2020-01-01 1 2.12 11.6
2 2020-01-01 1.25 2.19 11.5
3 2020-01-01 1.5 2.27 11.4
4 2020-01-01 1.75 2.34 11.3
5 2020-01-01 2 2.42 11.2
6 2020-01-01 2.25 2.40 11.2
7 2020-01-01 2.5 2.39 11.1
8 2020-01-01 2.75 2.38 11.1
9 2020-01-01 3 2.37 11.0
10 2020-01-01 3.25 2.41 11.0
11 2020-01-01 3.5 2.46 11.0
12 2020-01-01 3.75 2.50 10.9
13 2020-01-01 4 2.55 10.9
14 2020-01-01 4.25 2.54 10.9
15 2020-01-01 4.5 2.53 10.9
16 2020-01-01 4.75 2.52 11.0
17 2020-01-01 5 2.51 11.0
18 2020-01-01 5.25 2.50 11.0
19 2020-01-01 5.5 2.49 11.0
20 2020-01-01 5.75 2.49 11.1
21 2020-01-01 6 2.48 11.1
22 2020-01-01 6.25 2.49 10.9
23 2020-01-01 6.5 2.51 10.8
24 2020-01-01 6.75 2.52 10.7
25 2020-01-01 7 2.54 10.5
26 2020-01-01 7.25 2.55 10.4
27 2020-01-01 7.5 2.57 10.2
28 2020-01-01 7.75 2.58 10.1
29 2020-01-01 8 2.60 9.95
30 2020-01-01 8.25 2.63 10.1
31 2020-01-01 8.5 2.65 10.2
32 2020-01-01 8.75 2.68 10.3
33 2020-01-01 9 2.71 10.5
34 2020-01-01 9.25 2.69 10.6
35 2020-01-01 9.5 2.67 10.7
36 2020-01-01 9.75 2.65 10.9
37 2020-01-01 10 2.63 11.0
38 2020-01-01 10.2 2.65 10.8
39 2020-01-01 10.5 2.67 10.6
40 2020-01-01 10.8 2.69 10.3
41 2020-01-01 11 2.72 10.1
42 2020-01-01 11.2 2.75 9.89
43 2020-01-01 11.5 2.78 9.67
44 2020-01-01 11.8 2.81 9.44
45 2020-01-01 12 2.84 9.22
46 2020-01-01 12.2 2.83 9.39
47 2020-01-01 12.5 2.81 9.56
48 2020-01-01 12.8 2.80 9.74
49 2020-01-01 13 2.79 9.91
50 2020-01-01 13.2 2.80 10.1
51 2020-01-01 13.5 2.81 10.3
52 2020-01-01 13.8 2.82 10.4
53 2020-01-01 14 2.83 10.6
54 2020-01-01 14.2 2.86 10.5
55 2020-01-01 14.5 2.88 10.4
56 2020-01-01 14.8 2.91 10.2
57 2020-01-01 15 2.94 10.1
58 2020-01-01 15.2 2.95 10.0
59 2020-01-01 15.5 2.96 9.88
60 2020-01-01 15.8 2.97 9.76
61 2020-01-01 16 2.98 9.65
62 2020-01-01 16.2 2.99 9.53
63 2020-01-01 16.5 3.00 9.41
64 2020-01-01 16.8 3.01 9.30
65 2020-01-01 17 3.03 9.18
66 2020-01-01 17.2 3.05 9.06
67 2020-01-01 17.5 3.07 8.95
68 2020-01-01 17.8 3.09 8.83
69 2020-01-01 18 3.11 8.71
70 2020-01-01 18.2 3.13 8.47
71 2020-01-01 18.5 3.14 8.23
72 2020-01-01 18.8 3.16 7.98
73 2020-01-01 19 3.18 7.74
74 2020-01-01 19.2 3.18 7.50
75 2020-01-01 19.5 3.18 7.25
76 2020-01-01 19.8 3.18 7.01
77 2020-01-01 20 3.18 6.77
78 2020-01-01 20.2 3.18 5.94
79 2020-01-01 20.5 3.18 5.10
80 2020-01-01 20.8 3.18 4.27
81 2020-01-01 21 3.18 3.43
82 2020-01-01 21.2 3.22 2.60
83 2020-01-01 21.5 3.25 1.77
84 2020-01-01 21.8 3.29 0.934
85 2020-01-01 22 3.32 0.100
# ... with 31,025 more rows
https://github.com/TRobin82/WaterQuality
The above link will get you to the raw data.
What I am looking for is a data frame that looks like this but it will have 366 rows for each date during the year.
> TDO3
dates tdo3
1 2020-1-1 3.183500
2 2020-2-1 3.341188
3 2020-3-1 3.338625
4 2020-4-1 3.437000
5 2020-5-1 4.453310
6 2020-6-1 5.887560
7 2020-7-1 6.673700
8 2020-8-1 7.825672
9 2020-9-1 8.861190
10 2020-10-1 11.007972
11 2020-11-1 7.136880
12 2020-12-1 2.752500
However a DO value of a perfect 3 mg/L is not found in the interpolation data frame of DO so I would need the function to find the closest value to 3 without going below then match the depth of that value up with the other data frame for temperature to assign the proper temperature at that depth.
I am assuming the best route to take is a for-loop but not sold on the proper way to go about this question.
here's one way of doing it with tidyverse-style functions. Note that this code is reproducible because anyone can run it and should get the same answer. It's great that you showed us your data, but it's even better to post the output of dput() because then people can load the data and start helping you immediately.
This code does the following:
Load the data from the link you provided. But since there were several data files I had to guess which one you meant.
Groups the observations by date.
Puts the observations in increasing order of mean_do.
Removes rows with values of mean_do that are strictly less than 3.
Takes the first ordered observation for each date (this will be the one with the lowest value of mean_do that is greater than or equal to 3).
Rename the column mean_temp as tdo3 since it's the temperature for that date when the dissolved oxygen level was closest to 3mg/L.
library(tidyverse)
do_temp <- read_csv("https://raw.githubusercontent.com/TRobin82/WaterQuality/main/DateDepthTempDo.csv") %>%
select(-X1)
do_temp %>%
group_by(date) %>%
arrange(mean_do) %>%
filter(mean_do > 3) %>%
slice_head(n=1) %>%
rename(tdo3 = mean_temp) %>%
select(date, tdo3)
Here are the results. They're a bit different from the ones you posted, so I'm not sure if I've misunderstood you or if those were just illustrative and not real results.
# A tibble: 366 x 2
# Groups: date [366]
date tdo3
<date> <dbl>
1 2020-01-01 3.18
2 2020-01-02 3.18
3 2020-01-03 3.19
4 2020-01-04 3.21
5 2020-01-05 3.21
6 2020-01-06 3.21
7 2020-01-07 3.24
8 2020-01-08 3.28
9 2020-01-09 3.27
10 2020-01-10 3.28
# ... with 356 more rows
Let me know if you were looking for something else.

R data.table, select columns with no NA

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

creating an array of grouped values (means)

I have a large dataset ("bsa", drawn from a 23-year period) which includes a variable ("leftrigh") for "left-right" views (political orientation). I'd like to summarise how the cohorts change over time. For example, in 1994 the average value of this scale for people aged 45 was (say) 2.6; in 1995 the average value of this scale for people aged 46 was (say) 2.7 -- etc etc. I've created a year-of-birth variable ("yrbrn") to facilitate this.
I've successfully created the means:
bsa <- bsa %>% group_by(yrbrn, syear) %>% mutate(meanlr = mean(leftrigh))
Where I'm struggling is to summarise the means by year (of the survey) and age (at the time of the survey). If I could create an array (containing these means) organised by age x survey-year, I could see the change over time by inspecting the diagonals. But I have no clue how to do this -- my skills are very limited...
A tibble: 66,744 x 10
Groups: yrbrn [104]
Rsex Rage leftrigh OldWt syear yrbrn coh per agecat meanlr
1 1 [Male] 40 1 [left] 1.12 2017 1977 17 2017 [37,47) 2.61
2 2 [Female] 79 1.8 0.562 2017 1938 9 2017 [77,87) 2.50
3 2 [Female] 50 1.5 1.69 2017 1967 15 2017 [47,57) 2.59
4 1 [Male] 73 2 0.562 2017 1944 10 2017 [67,77) 2.57
5 2 [Female] 31 3 0.562 2017 1986 19 2017 [27,37) 2.56
6 1 [Male] 74 2.2 0.562 2017 1943 10 2017 [67,77) 2.50
7 2 [Female] 58 2 0.562 2017 1959 13 2017 [57,67) 2.56
8 1 [Male] 59 1.2 0.562 2017 1958 13 2017 [57,67) 2.53
9 2 [Female] 19 4 1.69 2017 1998 21 2017 [17,27) 2.46
Possible format for presenting this information to see change over time:
1994 1995 1996 1997 1998 1999 2000
18
19
20
21
22
23
24
25
etc.
You can group_by both age and year at the same time:
# Setup (& make reproducible data...)
n <- 10000
df1 <- data.frame(
'yrbrn' = sample(1920:1995, size = n, replace = T),
'Syear' = sample(2005:2015, size = n, replace = T),
'leftrigh' = sample(seq(0,5,0.1), size = n, replace = T))
# Solution
df1 %>%
group_by(yrbrn, Syear) %>%
summarise(meanLR = mean(leftrigh)) %>%
spread(Syear, meanLR)
Produces the following:
# A tibble: 76 x 12
# Groups: yrbrn [76]
yrbrn `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1920 3.41 1.68 2.26 2.66 3.21 2.59 2.24 2.39 2.41 2.55 3.28
2 1921 2.43 2.71 2.74 2.32 2.24 1.89 2.85 3.27 2.53 1.82 2.65
3 1922 2.28 3.02 1.39 2.33 3.25 2.09 2.35 1.83 2.09 2.57 1.95
4 1923 3.53 3.72 2.87 2.05 2.94 1.99 2.8 2.88 2.62 3.14 2.28
5 1924 1.77 2.17 2.71 2.18 2.71 2.34 2.29 1.94 2.7 2.1 1.87
6 1925 1.83 3.01 2.48 2.54 2.74 2.11 2.35 2.65 2.57 1.82 2.39
7 1926 2.43 3.2 2.53 2.64 2.12 2.71 1.49 2.28 2.4 2.73 2.18
8 1927 1.33 2.83 2.26 2.82 2.34 2.09 2.3 2.66 3.09 2.2 2.27
9 1928 2.34 2.02 2.1 2.88 2.14 2.44 2.58 1.67 2.57 3.11 2.93
10 1929 2.31 2.29 2.93 2.08 2.11 2.47 2.39 1.76 3.09 3 2.9

R: returning row value when certain number of columns reach certain value

Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!
If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)
As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50
Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.

Resources