Merging 2 dataframes (when columns are different) - r

I am trying to merge 2 data frames.
The main dataset, df1, contains numerical data in wide format - each row represents a date, each column contains the value for that date in a given city.
df2 contains metadata for each city: latitude, longitude, and elevation.
What I wish to do is add the metadata for each city to df1, but I was unsuccessful in doing so as the data frames don't match up in structure.
df1
Date Machrihanish High_Wycombe Camborne Dun_Fell Plymouth
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20200101 8.5 6.9 9.6 3.3 9.9
2 20200102 11.7 9.1 11.2 5 10.9
3 20200103 9.1 9.9 11.2 5.1 11.1
4 20200104 9.2 8.1 9.4 2.2 9.4
5 20200105 11.7 7.6 9 4.3 9.3
6 20200106 10.8 8 11.6 3.7 10.6
7 20200107 14.7 11.7 12 6.7 11.5
8 20200108 11.2 11.8 11.6 6.2 11.3
9 20200109 7 12 11.6 -0.2 11.5
10 20200110 9.3 7.4 10 0 10.1
df2
Location Longitude Latitude Elevation
<chr> <dbl> <dbl> <dbl>
1 Machrihanish -5.70 55.4 10
2 High_Wycombe -0.807 51.7 204
3 Camborne -5.33 50.2 87
4 Dun_Fell -2.45 54.7 847
5 Plymouth -4.12 50.4 50

Here is a solution that tidies the data to long format by location and day, and merges the lat / long information.
Using data provided in the original post, we read it into two data frames.
tempText <- "rowId Date Machrihanish High_Wycombe Camborne Dun_Fell Plymouth
1 20200101 8.5 6.9 9.6 3.3 9.9
2 20200102 11.7 9.1 11.2 5 10.9
3 20200103 9.1 9.9 11.2 5.1 11.1
4 20200104 9.2 8.1 9.4 2.2 9.4
5 20200105 11.7 7.6 9 4.3 9.3
6 20200106 10.8 8 11.6 3.7 10.6
7 20200107 14.7 11.7 12 6.7 11.5
8 20200108 11.2 11.8 11.6 6.2 11.3
9 20200109 7 12 11.6 -0.2 11.5
10 20200110 9.3 7.4 10 0 10.1"
library(tidyr)
library(dplyr)
temps <- read.table(text = tempText,header = TRUE)
latLongs <-"rowId Location Longitude Latitude Elevation
1 Machrihanish -5.70 55.4 10
2 High_Wycombe -0.807 51.7 204
3 Camborne -5.33 50.2 87
4 Dun_Fell -2.45 54.7 847
5 Plymouth -4.12 50.4 50"
latLongs <- read.table(text = latLongs,header = TRUE)
Next, we use tidyr::pivot_longer() to generate long format data, and then merge it with the lat long data via dplyr::full_join(). Note that we set the name of the column where the wide format column names are stored with names_to = "Location" so that full_join() uses Location to join the two data frames.
temps %>%
select(-rowId) %>%
pivot_longer(.,Machrihanish:Plymouth,names_to = "Location", values_to="MaxTemp") %>%
full_join(.,latLongs) %>% select(-rowId) -> joinedData
head(joinedData)
...and the first few rows of joined output looks like this:
> head(joinedData)
# A tibble: 6 × 6
Date Location MaxTemp Longitude Latitude Elevation
<int> <chr> <dbl> <dbl> <dbl> <int>
1 20200101 Machrihanish 8.5 -5.7 55.4 10
2 20200101 High_Wycombe 6.9 -0.807 51.7 204
3 20200101 Camborne 9.6 -5.33 50.2 87
4 20200101 Dun_Fell 3.3 -2.45 54.7 847
5 20200101 Plymouth 9.9 -4.12 50.4 50
6 20200102 Machrihanish 11.7 -5.7 55.4 10
>

Related

Copy the value of a column from the previous row where conditions on other columns are met

I have a dataframe (df) with 3 columns (V1, V2 and V3). I would like to add a column (V4), in which I enter (for each row) the value of V3 from another row in which the value of V1+0.5 equals V1 of row i AND in which the value of V2+0.5 equals V2 of row i.
For the rows where this condition is not met, I want an NA in the column of V4.
V1 <- c(-2.5,-2,-1.5,-1,-0.5,0,0.5,1,1.5,2,2.5,3,3.5)
V2 <- c(14,14.5,15,15.5,1,1.5,2,2.5,8,8.5,9,9.5,10)
V3 <- c(42,42.1,42.2,42.3,42.4,42.5,42.6,42.7,42.8,42.9,43,43.1,43.2)
df <- data.frame(V1,V2,V3)
For this input data:
V1 V2 V3
-2.5 14 42
-2 14.5 42.1
-1.5 15 42.2
-1 15.5 42.3
-0.5 1 42.4
0 1.5 42.5
0.5 2 42.6
1 2.5 42.7
1.5 8 42.8
2 8.5 42.9
2.5 9 43
3 9.5 43.1
3.5 10 43.2
My desired result would be:
V1 V2 V3 V4
-2.5 14 42 NA
-2 14.5 42.1 42
-1.5 15 42.2 42.1
-1 15.5 42.3 42.2
-0.5 1 42.4 NA
0 1.5 42.5 42.4
0.5 2 42.6 42.5
1 2.5 42.7 42.6
1.5 8 42.8 NA
2 8.5 42.9 42.8
2.5 9 43 42.9
3 9.5 43.1 43
3.5 10 43.2 43.1
I figured I could use a for-loop and an ifelse statement (using is.na for the NA values), but I do not know how to refer to the rows using something that looks like df$V1(of row i) == df$V1+0.5 (of row x) (and the same for V2).
Another dplyr solution would be with using ifelse:
library(dplyr)
df1 %>%
mutate(V4 = ifelse(V1 == lag(V1) + 0.5 & V2 == lag(V2) + 0.5, lag(V3), NA))
#> V1 V2 V3 V4
#> 1 -2.5 14.0 42.0 NA
#> 2 -2.0 14.5 42.1 42.0
#> 3 -1.5 15.0 42.2 42.1
#> 4 -1.0 15.5 42.3 42.2
#> 5 -0.5 1.0 42.4 NA
#> 6 0.0 1.5 42.5 42.4
#> 7 0.5 2.0 42.6 42.5
#> 8 1.0 2.5 42.7 42.6
#> 9 1.5 8.0 42.8 NA
#> 10 2.0 8.5 42.9 42.8
#> 11 2.5 9.0 43.0 42.9
#> 12 3.0 9.5 43.1 43.0
#> 13 3.5 10.0 43.2 43.1
Data:
df1 <- data.frame(V1 = c(-2.5,-2,-1.5,-1,-0.5,0,0.5,1,1.5,2,2.5,3,3.5),
V2 = c(14,14.5,15,15.5,1,1.5,2,2.5,8,8.5,9,9.5,10),
V3 = c(42,42.1,42.2,42.3,42.4,42.5,42.6,42.7,42.8,42.9,43,43.1,43.2))

How does use a conditional(if or while) for replacing values in R?

I am trying to work with this a data that has information about price electricity. The price is registered each 5 minutes. My objective is replace the negative values with the mean of the day.
year month day fivemin rrp_nsw rrp_qld rrp_sa rrp_tas rrp_vic
2009 7 1 1 16.9 17.6 16.7 15.7 15.5
2009 7 1 2 17.7 18.8 17.8 -16.1 15.5
2009 7 1 3 -17.7 18.6 18.1 15.9 15.4
2009 7 1 4 16.7 18.6 -17.6 14.3 12.8
2009 7 2 1 -15.6 17.6 16.3 13.2 11.8
2009 7 2 2 13.7 15.7 12.0 -11.1 -12.9
2009 7 2 3 13.7 15.8 11.9 11.1 12.9
2009 7 2 4 -13.9 16.1 -12.1 11.2 12.9
2009 8 1 1 13.8 16.0 12.2 11.2 12.8
2009 8 1 2 -13.7 16.3 11.6 10.6 12.6
2009 8 1 3 13.7 -15.8 11.9 11.0 12.7
2009 8 1 4 13.8 16.0 12.1 11.2 12.9
2009 8 2 1 17.6 -17.6 17.3 16.5 17.1
2009 8 2 2 17.7 17.6 17.3 16.8 17.4
2009 8 2 3 15.8 16.0 15.1 15.0 15.5
2009 8 2 4 -15.4 15.6 14.5 14.6 15.1
2009 9 1 1 14.7 15.0 13.8 14.0 14.5
2009 9 1 2 15.3 15.4 14.3 14.6 15.0
2009 9 1 3 15.3 15.6 14.4 14.5 15.0
2009 9 1 4 14.9 15.7 13.7 13.8 14.5
In order to obtain the mean of each day I use the following code
Daily_mean<-Base %>%
arrange(year, month, day, fivemin) %>% #we are ordering the data
group_by(year, month, day)%>%
summarise_at(
vars(c(rrp_nsw, rrp_qld, rrp_sa, rrp_tas, rrp_vic)),
.funs = funs(mean(.)))
When I get the daily mean I want to replace each negative value with the mean of the day. For example using the 16th observation
2009 8 2 4 "8.925" 15.6 14.5 14.6 15.1
If someone can help me i would be grateful
We can use a replace in mutate_at to change the negative values to mean of that column after grouping by the relevant columns
library(dplyr)
Base %>%
arrange(year, month, day, fivemin) %>%
group_by(year, month, day) %>%
mutate_at(vars(rrp_nsw, rrp_qld, rrp_sa, rrp_tas, rrp_vic),
~ replace(., . < 0, mean(.)))

How to convert a character array to data frame

I have a character array dat which I want to convert to a data frame df but it is not working
head(dat)
[1] " 1931 1 5.0 0.6 11 78.4 43.4"
[2] " 1931 2 6.7 0.7 7 48.9 63.6"
[3] " 1931 4 10.4 3.1 3 44.6 110.1"
[4] " 1931 5 13.2 6.1 1 63.7 167.4"
[5] " 1931 6 15.4 8.0 0 87.8 150.3"
[6] " 1931 7 17.3 10.6 0 121.4 111.2"
> df<-as.data.frame(dat)
> head(df)
dat
1 1931 1 5.0 0.6 11 78.4 43.4
2 1931 2 6.7 0.7 7 48.9 63.6
3 1931 4 10.4 3.1 3 44.6 110.1
4 1931 5 13.2 6.1 1 63.7 167.4
5 1931 6 15.4 8.0 0 87.8 150.3
6 1931 7 17.3 10.6 0 121.4 111.2
df[,c(3)]
Error in [.data.frame(df, , c(3)) : undefined columns selected
Reading with read.table: You can rename as desired.
df<-read.table(text = " dat
1 1931 1 5.0 0.6 11 78.4 43.4
2 1931 2 6.7 0.7 7 48.9 63.6
3 1931 4 10.4 3.1 3 44.6 110.1
4 1931 5 13.2 6.1 1 63.7 167.4
5 1931 6 15.4 8.0 0 87.8 150.3
6 1931 7 17.3 10.6 0 121.4 111.2",
header=F,fill=T,as.is=T,skip = 1)
df[3]
V3
1 1
2 2
3 4
4 5
5 6
6 7
If dat is as shown reproducibly in the Note at the end then as.data.frame(dat) creates a data frame with one column called dat and then when there is an attempt to take the 3rd column an error results since there is only one column.
Instead, use read.table and get the third column like this. Omit the comma if you want a data frame result.
read.table(text = dat)[, 3]
## [1] 5.0 6.7 10.4 13.2 15.4 17.3
Note
dat <- c(" 1931 1 5.0 0.6 11 78.4 43.4",
" 1931 2 6.7 0.7 7 48.9 63.6",
" 1931 4 10.4 3.1 3 44.6 110.1",
" 1931 5 13.2 6.1 1 63.7 167.4",
" 1931 6 15.4 8.0 0 87.8 150.3",
" 1931 7 17.3 10.6 0 121.4 111.2")
Here's a tidyverse approach:
dat <- c(" 1931 1 5.0 0.6 11 78.4 43.4",
" 1931 2 6.7 0.7 7 48.9 63.6",
" 1931 4 10.4 3.1 3 44.6 110.1",
" 1931 5 13.2 6.1 1 63.7 167.4",
" 1931 6 15.4 8.0 0 87.8 150.3",
" 1931 7 17.3 10.6 0 121.4 111.2")
library(tidyverse)
str_trim(dat) %>% # trim leading space
tibble(x = .) %>% # put into tibble (data.frame)
separate(x, # separate x into 7 columns, named below
into = c("year","v1","v2","v3","v4","v5","v6"),
sep = "[ ]{1,}") # separate by one or more spaces ("[ ]{1,}")
That leads to:
# A tibble: 6 x 7
year v1 v2 v3 v4 v5 v6
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1931 1 5.0 0.6 11 78.4 43.4
2 1931 2 6.7 0.7 7 48.9 63.6
3 1931 4 10.4 3.1 3 44.6 110.1
4 1931 5 13.2 6.1 1 63.7 167.4
5 1931 6 15.4 8.0 0 87.8 150.3
6 1931 7 17.3 10.6 0 121.4 111.2

Calculate medians of rows in a grouped dataframe

I have a dataframe containing multiple entries per week. It looks like this:
Week t_10 t_15 t_18 t_20 t_25 t_30
1 51.4 37.8 25.6 19.7 11.9 5.6
2 51.9 37.8 25.8 20.4 12.3 6.2
2 52.4 38.5 26.2 20.5 12.3 6.1
3 52.2 38.6 26.1 20.4 12.4 5.9
4 52.2 38.3 26.1 20.2 12.1 5.9
4 52.7 38.4 25.8 20.0 12.1 5.9
4 51.1 37.8 25.7 20.0 12.2 6.0
4 51.9 38.0 26.0 19.8 12.0 5.8
The Weeks have different amounts of entries, they range from one entry for a week to multiple (up to 4) entries a week.
I want to calculate the medians of each week and output it for all the different variables (t_10 throughout to t_30) in a new dataframe. NA cells are already omitted in the original dataframe. I have tried different approaches through the ddply function of the plyrpackage but to no avail so far.
We could use summarise_at for multiple columns
library(dplyr)
colsToKeep <- c("t_10", "t_30")
df1 %>%
group_by(Week) %>%
summarise_at(vars(colsToKeep), median)
# A tibble: 4 x 3
# Week t_10 t_30
# <int> <dbl> <dbl>
#1 1 51.40 5.60
#2 2 52.15 6.15
#3 3 52.20 5.90
#4 4 52.05 5.90
Specify variables to keep in colsToKeep and store input table in d
library(tidyverse)
colsToKeep <- c("t_10", "t_30")
gather(d, variable, value, -Week) %>%
filter(variable %in% colsToKeep) %>%
group_by(Week, variable) %>%
summarise(median = median(value))
# A tibble: 8 x 3
# Groups: Week [4]
Week variable median
<int> <chr> <dbl>
1 1 t_10 51.40
2 1 t_30 5.60
3 2 t_10 52.15
4 2 t_30 6.15
5 3 t_10 52.20
6 3 t_30 5.90
7 4 t_10 52.05
8 4 t_30 5.90
You can also use the aggregate function:
newdf <- aggregate(data = df, Week ~ . , median)

Draw histograms per row over multiple columns in R

I'm using R for the analysis of my master thesis
I have the following data frame: STOF: Student to staff ratio
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 41.8 147.6 90.3 82.9 106.8 63.0
2 MO 20.0 20.8 21.1 20.9 12.6 20.6
3 SD 21.2 32.3 25.7 23.9 25.0 40.1
4 UN 51.8 39.8 19.9 20.9 21.6 22.5
5 WS 18.0 19.9 15.3 13.6 15.7 15.2
6 BF 11.5 36.9 20.0 23.2 18.2 23.8
7 ME 34.2 30.3 28.4 30.1 31.5 25.6
8 IM 7.7 18.1 20.5 14.6 17.2 17.1
9 OM 11.4 11.2 12.2 11.1 13.4 19.2
10 DC 14.3 28.7 20.1 17.0 22.3 16.2
11 OC 28.6 44.0 24.9 27.9 34.0 30.7
Then I rank colleges using this commend
HEIrank1<-(STOF[,-c(1)])
rank1 <- apply(HEIrank1,2,rank)
> HEIrank11
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 18.0 20 20.0 20.0 20.0 20
2 MO 14.0 9 13.0 13.5 2.0 12
3 SD 15.0 16 17.0 16.0 16.0 19
4 UN 20.0 18 8.0 13.5 14.0 13
5 WS 12.0 8 4.0 7.0 6.0 8
6 BF 6.5 17 9.5 15.0 10.0 14
7 ME 17.0 15 19.0 19.0 17.0 15
8 IM 2.0 6 12.0 8.0 8.5 10
9 OM 4.5 3 2.5 3.0 3.0 11
10 DC 11.0 14 11.0 9.0 15.0 9
11 OC 16.0 19 16.0 18.0 19.0 17
I would like to draw histogram for each HEIs (for each row)?
If you use ggplot you won't need to do it as a loop, you can plot them all at once. Also, you need to reformat your data so that it's in long format not short format. You can use the melt function from the reshape package to do so.
library(reshape2)
new.df<-melt(HEIrank11,id.vars="HEI.ID")
names(new.df)=c("HEI.ID","Year","Rank")
substring is just getting rid of the X in each year
library(ggplot2)
ggplot(new.df, aes(x=HEI.ID,y=Rank,fill=substring(Year,2)))+
geom_histogram(stat="identity",position="dodge")
Here's a solution in lattice:
require(lattice)
barchart(X2007+X2008+X2009+X2010+X2011+X2012 ~ HEI.ID,
data=HEIrank11,
auto.key=list(space='right')
)

Resources