This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I would like to know how to transform rows to columns for the following dataset.
School class Avg Subavg Sub
ABC 2 25.3 17.2 Geo
ABC 2 25.3 18.2 Mat
ABC 2 25.3 20.2 Fre
ABC 3 21.2 17.2 Geo
ABC 3 21.2 18.2 Mat
ABC 3 21.2 20.2 Ger
ABC 4 16.8 17.2 Ger
ABC 4 16.8 18.2 Mat
ABC 5 20.2 20.2 Fre
Expected output would be
School Std stdavg Geo mat Ger Fer
ABC 2 25.3 17.2 18.2 NA 20.2
ABC 3 21.2 17.2 18.2 20.2 NA
ABC 4 25.3 NA 18.2 17.2 NA
ABC 5 25.3 NA NA NA 20.2
I used split function, But in vain.
Thanks in advance
We can use dcast
library(data.table)
dcast(setDT(df1), School+class+Avg~Sub, value.var="Subavg")
# School class Avg Fre Geo Ger Mat
#1: ABC 2 25.3 20.2 17.2 NA 18.2
#2: ABC 3 21.2 NA 17.2 20.2 18.2
#3: ABC 4 24.8 NA NA 17.2 18.2
#4: ABC 5 24.8 20.2 NA NA NA
Or use spread from tidyr
library(tidyr)
spread(df1, Sub, Subavg)
Related
I am trying to merge 2 data frames.
The main dataset, df1, contains numerical data in wide format - each row represents a date, each column contains the value for that date in a given city.
df2 contains metadata for each city: latitude, longitude, and elevation.
What I wish to do is add the metadata for each city to df1, but I was unsuccessful in doing so as the data frames don't match up in structure.
df1
Date Machrihanish High_Wycombe Camborne Dun_Fell Plymouth
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20200101 8.5 6.9 9.6 3.3 9.9
2 20200102 11.7 9.1 11.2 5 10.9
3 20200103 9.1 9.9 11.2 5.1 11.1
4 20200104 9.2 8.1 9.4 2.2 9.4
5 20200105 11.7 7.6 9 4.3 9.3
6 20200106 10.8 8 11.6 3.7 10.6
7 20200107 14.7 11.7 12 6.7 11.5
8 20200108 11.2 11.8 11.6 6.2 11.3
9 20200109 7 12 11.6 -0.2 11.5
10 20200110 9.3 7.4 10 0 10.1
df2
Location Longitude Latitude Elevation
<chr> <dbl> <dbl> <dbl>
1 Machrihanish -5.70 55.4 10
2 High_Wycombe -0.807 51.7 204
3 Camborne -5.33 50.2 87
4 Dun_Fell -2.45 54.7 847
5 Plymouth -4.12 50.4 50
Here is a solution that tidies the data to long format by location and day, and merges the lat / long information.
Using data provided in the original post, we read it into two data frames.
tempText <- "rowId Date Machrihanish High_Wycombe Camborne Dun_Fell Plymouth
1 20200101 8.5 6.9 9.6 3.3 9.9
2 20200102 11.7 9.1 11.2 5 10.9
3 20200103 9.1 9.9 11.2 5.1 11.1
4 20200104 9.2 8.1 9.4 2.2 9.4
5 20200105 11.7 7.6 9 4.3 9.3
6 20200106 10.8 8 11.6 3.7 10.6
7 20200107 14.7 11.7 12 6.7 11.5
8 20200108 11.2 11.8 11.6 6.2 11.3
9 20200109 7 12 11.6 -0.2 11.5
10 20200110 9.3 7.4 10 0 10.1"
library(tidyr)
library(dplyr)
temps <- read.table(text = tempText,header = TRUE)
latLongs <-"rowId Location Longitude Latitude Elevation
1 Machrihanish -5.70 55.4 10
2 High_Wycombe -0.807 51.7 204
3 Camborne -5.33 50.2 87
4 Dun_Fell -2.45 54.7 847
5 Plymouth -4.12 50.4 50"
latLongs <- read.table(text = latLongs,header = TRUE)
Next, we use tidyr::pivot_longer() to generate long format data, and then merge it with the lat long data via dplyr::full_join(). Note that we set the name of the column where the wide format column names are stored with names_to = "Location" so that full_join() uses Location to join the two data frames.
temps %>%
select(-rowId) %>%
pivot_longer(.,Machrihanish:Plymouth,names_to = "Location", values_to="MaxTemp") %>%
full_join(.,latLongs) %>% select(-rowId) -> joinedData
head(joinedData)
...and the first few rows of joined output looks like this:
> head(joinedData)
# A tibble: 6 × 6
Date Location MaxTemp Longitude Latitude Elevation
<int> <chr> <dbl> <dbl> <dbl> <int>
1 20200101 Machrihanish 8.5 -5.7 55.4 10
2 20200101 High_Wycombe 6.9 -0.807 51.7 204
3 20200101 Camborne 9.6 -5.33 50.2 87
4 20200101 Dun_Fell 3.3 -2.45 54.7 847
5 20200101 Plymouth 9.9 -4.12 50.4 50
6 20200102 Machrihanish 11.7 -5.7 55.4 10
>
I have weather data with NAs sporadically throughout and I want to calculate rolling means. I have been using the rollapplyr function within zoo but even though I include partial = TRUE, it still puts a NA whenever, for example, there is a NA in 1 of the 30 values to be averaged.
Here is the formula:
weather_rolled <- weather %>%
mutate(maxt30 = rollapplyr(max_temp, 30, mean, partial = TRUE))
Here's my data:
A tibble: 7,160 x 11
station_name date max_temp avg_temp min_temp rainfall rh avg_wind_speed dew_point avg_bare_soil_temp total_solar_rad
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VEGREVILLE 2019-01-01 0.9 -7.9 -16.6 1 81.7 20.2 -7.67 NA NA
2 VEGREVILLE 2019-01-02 5.5 1.5 -2.5 0 74.9 13.5 -1.57 NA NA
3 VEGREVILLE 2019-01-03 3.3 -0.9 -5 0.5 80.6 10.1 -3.18 NA NA
4 VEGREVILLE 2019-01-04 -1.1 -4.7 -8.2 5.2 92.1 8.67 -4.76 NA NA
5 VEGREVILLE 2019-01-05 -3.8 -6.5 -9.2 0.2 92.6 14.3 -6.81 NA NA
6 VEGREVILLE 2019-01-06 -3 -4.4 -5.9 0 91.1 16.2 -5.72 NA NA
7 VEGREVILLE 2019-01-07 -5.8 -12.2 -18.5 0 75.5 30.6 -16.9 NA NA
8 VEGREVILLE 2019-01-08 -17.4 -21.6 -25.7 1.2 67.8 16.1 -26.1 NA NA
9 VEGREVILLE 2019-01-09 -12.9 -15.1 -17.4 0.2 71.5 14.3 -17.7 NA NA
10 VEGREVILLE 2019-01-10 -13.2 -17.9 -22.5 0.4 80.2 3.38 -21.8 NA NA
# ... with 7,150 more rows
Essentially, whenever a NA appears midway through, it results in a lot of NAs for the rolling mean. I want to still calculate the rolling mean within that time frame, ignoring the NAs. Does anyone know a way to get around this? I have been searching online for hours to no avail.
Thanks!
Please, how do I compute the average, that is, mean of the last 5 observations by class in a data: the first column is the class i.e., Plot and the second column is the measured variable i.e., Weight.
Plot Weight
1 12.5
1 14.5
1 15.8
1 16.1
1 18.9
1 21.2
1 23.4
1 25.7
2 13.1
2 15.0
2 15.8
2 16.3
2 17.4
2 18.6
2 22.6
2 24.1
2 25.6
3 11.5
3 12.2
3 13.9
3 14.7
3 18.9
3 20.5
3 21.6
3 22.6
3 24.1
3 25.8
We select the last 5 observation for each 'Plot and get the mean
library(dplyr)
df1 %>%
group_by(Plot) %>%
summarise(MeanWt = mean(tail(Weight, 5)))
Or with data.table
library(data.table)
setDT(df1)[, .(MeanWt = mean(tail(Weight, 5))), by = Plot]
Or using base R
aggregate(cbind(MeanWt = Weight) ~ Plot, FUN = function(x) mean(tail(x, 5)))
I made this without a library:
It's a step-by-step solution, of course you can make the code shorter using a for or apply.
Hope you find it useful.
#Collecting your data
values <- scan()
1 12.5 1 14.5 1 15.8 1 16.1 1 18.9 1 21.2 1 23.4 1 25.7 2 13.1 2 15.0 2 15.8
2 16.3 2 17.4 2 18.6 2 22.6 2 24.1 2 25.6 3 11.5 3 12.2 3 13.9 3 14.7 3 18.9
3 20.5 3 21.6 3 22.6 3 24.1 3 25.8
data_w <- matrix(values, ncol=2, byrow = T)
#Naming your cols
colnames(data_w) <- c("Plot", "Weight")
dt_w <- as.data.frame(data_w)
#Mean of the 5 last observations by class:
#Computing number of Plots = 1
size1 <- length(which(dt_w$Plot == 1))
#Value to compute the last 5 values
index1 <- size1 - 5
#Way to compute the mean
mean1 <- mean(dt_w$Weight[index1:size1])
#mean of the last 5 observations of class 1
mean1
To compute for the class 2 and 3 it's the same process.
I have a dataframe containing multiple entries per week. It looks like this:
Week t_10 t_15 t_18 t_20 t_25 t_30
1 51.4 37.8 25.6 19.7 11.9 5.6
2 51.9 37.8 25.8 20.4 12.3 6.2
2 52.4 38.5 26.2 20.5 12.3 6.1
3 52.2 38.6 26.1 20.4 12.4 5.9
4 52.2 38.3 26.1 20.2 12.1 5.9
4 52.7 38.4 25.8 20.0 12.1 5.9
4 51.1 37.8 25.7 20.0 12.2 6.0
4 51.9 38.0 26.0 19.8 12.0 5.8
The Weeks have different amounts of entries, they range from one entry for a week to multiple (up to 4) entries a week.
I want to calculate the medians of each week and output it for all the different variables (t_10 throughout to t_30) in a new dataframe. NA cells are already omitted in the original dataframe. I have tried different approaches through the ddply function of the plyrpackage but to no avail so far.
We could use summarise_at for multiple columns
library(dplyr)
colsToKeep <- c("t_10", "t_30")
df1 %>%
group_by(Week) %>%
summarise_at(vars(colsToKeep), median)
# A tibble: 4 x 3
# Week t_10 t_30
# <int> <dbl> <dbl>
#1 1 51.40 5.60
#2 2 52.15 6.15
#3 3 52.20 5.90
#4 4 52.05 5.90
Specify variables to keep in colsToKeep and store input table in d
library(tidyverse)
colsToKeep <- c("t_10", "t_30")
gather(d, variable, value, -Week) %>%
filter(variable %in% colsToKeep) %>%
group_by(Week, variable) %>%
summarise(median = median(value))
# A tibble: 8 x 3
# Groups: Week [4]
Week variable median
<int> <chr> <dbl>
1 1 t_10 51.40
2 1 t_30 5.60
3 2 t_10 52.15
4 2 t_30 6.15
5 3 t_10 52.20
6 3 t_30 5.90
7 4 t_10 52.05
8 4 t_30 5.90
You can also use the aggregate function:
newdf <- aggregate(data = df, Week ~ . , median)
I have a dataframe for which I try to add additional column calculating the median of the current and the previous 2 values.
Date Value
21/07/2016 14.8
22/07/2016 14.9
23/07/2016 15.8
24/07/2016 15.0
25/07/2016 15.7
26/07/2016 15.6
27/07/2016 16.1
28/07/2016 16.1
I used the following code:
library(zoo)
dataframe$medianval <-rollmedian(dataframe$Value,k=3)
I get the following error
> Error: k <= n is not TRUE
Any suggestions?
Think about what R is trying to do here. The data frame has 8 rows, but the vector you want to append has only 6 elements. To which rows should those elements align? What should R put in the other two spots?
library(zoo)
dataframe <- read.table(text="Date Value
21/07/2016 14.8
22/07/2016 14.9
23/07/2016 15.8
24/07/2016 15.0
25/07/2016 15.7
26/07/2016 15.6
27/07/2016 16.1
28/07/2016 16.1", header=TRUE)
rollmedian(dataframe$Value,k=3)
# [1] 14.9 15.0 15.7 15.6 15.7 16.1
nrow(dataframe) # [1] 8
length(rollmedian(dataframe$Value,k=3)) # [1] 6
Because I can guess what you meant (correct me if I'm wrong), I would try:
dataframe$medianval <- c(NA, NA, rollmedian(dataframe$Value,k=3))
dataframe
# Date Value medianval
# 1 21/07/2016 14.8 NA
# 2 22/07/2016 14.9 NA
# 3 23/07/2016 15.8 14.9
# 4 24/07/2016 15.0 15.0
# 5 25/07/2016 15.7 15.7
# 6 26/07/2016 15.6 15.6
# 7 27/07/2016 16.1 15.7
# 8 28/07/2016 16.1 16.1
If you want to be able to adapt this conveniently, you should write a simple function:
med.fun <- function(var, data, k){
# Note: variable name must be in quotes
return(c(rep(NA, k-1), with(data, rollmedian(get(var), k=k))))
}
med.fun("Value", dataframe, 5)
# [1] NA NA NA NA 15.0 15.6 15.7 15.7