Reshaping issues in R: my reshaped dataframe changes 3 variables into 1 - r

I'm a relative newbie to R and trying to reshape my data into long format from wide format and having problems. I'm thinking that my problem may be due to having made the data.frame from a data.frame that I have created in R, getting mean values of the large data.frame into another data.frame.
What I have done is this created an empty data.frame (ndf):
ndf <- data.frame(matrix(ncol = 0, nrow = 3))
Then used lapply to get the means from the large data.frame (ldf) into separate columns in the new data.frame, with the year being used from the large data.frame:
ndf$Year <- names(ldf)
ndf$col1 <- lapply(ldf, function(i) {mean(i$col1)})
ndf$col2 <- lapply(ldf, function(i) {mean(i$col2)})
etc.
The melted function in reshape2 does not work apparently because there are non-atomic 'measure' columns.
For using the reshape base function I have used the code:
reshape.ndf <- reshape(ndf,
varying = list(names(ndf)[2:7]),
v.names = "cover",
timevar = "species",
times = names(ndf[2:7]),
new.row.names = 1:1000,
direction = "long")
My output is then essentially just using the first row for the variables. So my wide data.frame looks like this (sorry for the strange names):
Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5
And the long data.frame looks like this:
Year species cover id
1 2014 Cladonia.portentosa 11.75 1
2 2015 Cladonia.portentosa 11.75 2
3 2016 Cladonia.portentosa 11.75 3
4 2014 Erica.tetralix 35.00 1
5 2015 Erica.tetralix 35.00 2
6 2016 Erica.tetralix 35.00 3
Where the "cover" column should have the value from each year put into the cell with the corresponding year.
Please could someone tell me where I've gone wrong!?

Here is an example of 'melting' in tidyr.
You'll need tidyr but I also like dplyr and am including it here to encourage its use along with the rest of the tidyverse. You'll find endless great tutorials on the web...
library(dplyr)
library(tidyr)
Let's use iris as an example, I want a long form where species, variable and value are the columns.
data(iris)
Here it is with gather(). we specify that variable and value are the column names for the new 'melted' columns. we also specify that we do not want to melt the column Species which we want to remain its own column.
iris_long <- iris %>%
gather(variable, value, -Species)
inspect the iris_long object to make sure it worked.

In addition to roman's answer, I thought I would share exactly what I did with my data set.
My initial "wide" data.frame ndf looked like this:
Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5
I used downloaded tidyr
install.packages("tidyr")
Then selected the package
library(tidyr)
I then used the gather() function in the tidyr package to gather the species columns Cladonia.portentosa Erica.tetralix and Eriophorum.vaginatum together into one column, with a cover column in the new "long" data.frame.
long.ndf <- ndf %>% gather(species, cover, Cladonia.portentosa:Eriophorum.vaginatum)
Easy peasy!
Thanks again to roman for the suggestion!

I'm answering your question in case it may help someone using reshape function.
Please could someone tell me where I've gone wrong!?
You have not specified parameter idvar and reshape has created one for you named id. In order to avoid it, just add to your code the line idvar = "Year" :
ndf <- read.table(text =
"Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5",
header=TRUE, stringsAsFactors = F)
reshape.ndf <- reshape(ndf,
varying = list(names(ndf)[2:4]),
v.names = "cover",
idvar = "Year",
timevar = "species",
times = names(ndf[2:4]),
new.row.names = 1:9,
direction = "long")
The result looks as you were expecting
reshape.ndf
Year species cover
1 2014 Cladonia.portentosa 11.75
2 2015 Cladonia.portentosa 15.75
3 2016 Cladonia.portentosa 22.75
4 2014 Erica.tetralix 35.00
5 2015 Erica.tetralix 25.75
6 2016 Erica.tetralix 5.00
7 2014 Eriophorum.vaginatum 55.00
8 2015 Eriophorum.vaginatum 70.00
9 2016 Eriophorum.vaginatum 37.50

Related

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Best method for averaging across rows [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I have data with multiple observations per day, and I want to construct a table of daily averages. My instinctive approach (from other programming languages) is to sort the data by date and write a for loop to go through and average it out. But every time I see an R question involving for loops, there tends to be a strong response that R handles vector-type approaches much better. What would a smarter approach be to this problem?
For reference, my data looks something like
date observation
2017-4-4 17
2017-4-4 412
2017-4-4 9
2017-4-3 96
2017-4-3 14
2017-4-2 8
And I would like the output to be a new data frame that looks like
date average
2017-4-4 146
2017-4-3 55
2017-4-2 8
require("dplyr")
df <- data.frame(date = c('2017-4-4', '2017-4-4', '2017-4-4', '2017-4-3', '2017-4-3', '2017-4-2'),
observation = c(17, 412, 8, 96, 14, 8))
df %>%
group_by(date) %>%
summarise(average = mean(observation)) %>%
data.frame
tapply() can do that:
df <- read.table(header=TRUE, text=
'date observation
2017-4-4 17
2017-4-4 412
2017-4-4 9
2017-4-3 96
2017-4-3 14
2017-4-2 8')
df$date <- as.Date(df$date, format="%Y-%m-%d")
m <- tapply(df$observation, df$date, FUN=mean)
d.result <- data.frame(date=as.Date(names(m), format="%Y-%m-%d"), m)
# > d.result
# date m
# 2017-04-02 2017-04-02 8
# 2017-04-03 2017-04-03 55
# 2017-04-04 2017-04-04 146
or
aggregate(observation ~ date, data=df, FUN=mean)
or with data.table
library("data.table")
dt <- fread(
'date observation
2017-4-4 17
2017-4-4 412
2017-4-4 9
2017-4-3 96
2017-4-3 14
2017-4-2 8')
dt[ , .(observation = mean(observation)), by=date]

R: Combine duplicate columns after dplyr join

When you use a dplyr join function like full_join, columns with identical names are duplicated and given suffixes like "col.x", "col.y", "col.x.x", etc. when they are not used to join the tables.
library(dplyr)
data1<-data.frame(
Code=c(2,1,18,5),
Country=c("Canada", "USA", "Brazil", "Iran"),
x=c(50,29,40,29))
data2<-data.frame(
Code=c(2,40,18),
Country=c("Canada","Japan","Brazil"),
y=c(22,30,94))
data3<-data.frame(
Code=c(25,14,52),
Country=c("China","Japan","Australia"),
z=c(22,30,94))
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3))
This results in "Country", "Country.x", and "Country.y" columns.
Is there a way to combine the three columns into one, such that if a row has NA for a "Country", it takes the value from "Country.x" or "Country.y"?
I attempted a solution based on this similar question, but it gives me a warning and returns only values from the top three rows.
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3)) %>%
mutate(Country=coalesce(Country.x,Country.y,Country)) %>%
select(-Country.x, -Country.y)
This returns the warning invalid factor level, NA generated.
Any ideas?
You could use my package safejoin, make a full join and deal with the conflicts using dplyr::coalesce.
First we'll have to rename the tables to have value columns named the same.
library(dplyr)
data1 <- rename_at(data1,3, ~"value")
data2 <- rename_at(data2,3, ~"value")
data3 <- rename_at(data3,3, ~"value")
Then we can join
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
data1 %>%
safe_full_join(data2, by = c("Code","Country"), conflict = coalesce) %>%
safe_full_join(data3, by = c("Code","Country"), conflict = coalesce)
# Code Country value
# 1 2 Canada 50
# 2 1 USA 29
# 3 18 Brazil 40
# 4 5 Iran 29
# 5 40 Japan 30
# 6 25 China 22
# 7 14 Japan 30
# 8 52 Australia 94
You get some warnings because you're joining factor columns with different levels, add parameter check="" to remove them.

Combining two DFs in R and keeping only the rows which have a common date in it

I am fairly new to R but really have tried to find an answer to my problem, but was unsuccessful.
I have two data frames "Brexit_final" and "Brexit_Google_Trends". Both data frames have a "Date" column BUT! the Brexit_Final frame has less dates than the other one. I want to make a new set of data in which only the rows are kept where both frames have the date.
And in the process I also want to delete a lot of the columns.
Brexit_Final
Date Remain Leave Undecided Total_Difference
2016-06-18 42 44 13 7.5
2016-06-20 47.25 46 5.25 15
2016-06-23 55 45 0 14
Brexit_Google_Trends
Date EU Referendum Brexit Difference
2016-06-18 44 100 65 22
2016-06-19 23 100 62 55
2016-06-20 28 40 36 24
2016-06-21 37 55 43 36
2016-06-22 7 10 55 44
2016-06-23 67 100 62 103
Dream_Frame
Date Total_Difference Difference
2016-06-18 7.5 22
2016-06-20 15 24
2016-06-23 14 103
You can use an inner_join from the dplyr package.
inner_join(Brexit_Final, Brexit_Google_Trends, by = "Date") %>% select(Total_Difference, Difference)
From this canonical question, we get:
Dream_Frame <- merge(Brexit_Final, Brexit_Google_Trends, by = "Date")
Dream_Frame <- Dream_Frame[,c("Date", "Total_Difference", "Difference")
Or, to do it in one step,
Dream_Frame <- merge(Brexit_Final[, c("Date", "Total_Difference")],
Brexit_Google_Trends[, c("Date", "Difference")],
by = "Date")
Brexit_Final = Brexit_Final[,c("Date","Total_Difference")]
Brexit_Google_Trends = Brexit_Google_Trends[,c("Date","Difference")]
Dream = merge(Brexit_Final, Brexit_Google_Trends,by="Date")
Used the suggestion from "student"
inner_join(Brexit_Final, Brexit_Google_Trends, by = "Date") %>% select(Date, Total_Difference, Difference)
With the slight addition of adding in the "Date" as a column to keep.
If anybody else is struggling with this. A problem in my data frame was that the "Difference" and "Total_Difference" were not in numeric format but rather also a data frame I attached to the others. So I used:
Brexit_final$Total_Difference <- as.numeric(Brexit_final$Total_Difference[[1]])
And the same for "Difference" to make them numeric first. Then all the provided solutions worked.
Thanks for your help #all

converting a dataframe in given format

Given data frame values are
Group year Value
A 2010 17
A 2011 18
F 2010 8
F 2011 9
i want to convert it into
Year A F
2010 17 8
2011 18 9
is there any simple solution to solve this
library('reshape2')
df <- read.table(text=" Group year Value
A 2010 17
A 2011 18
F 2010 8
F 2011 9", header = TRUE)
dfc <- dcast(df, year ~ Group )
Although the syntax can be confusing, I still find reshape in base R useful to know. Using df provided by gauden
reshape_df <- reshape(df,dir="wide",idvar="year",timevar="Group")
colnames(reshape_df) <- c("year","A","F")
The converts to data from "long" format to "wide". Usually, the time variable becomes the column name, but in this case, we seek "A" and "F". Therefore, the syntax calls for timevar to be "Group".

Resources