Reshaping a data frame from wide to long in R [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 7 years ago.
I have the following data frame with temperature and pressure data from 3 sensors:
df <- data.frame(
Test = 1:10,
temperature_sensor1=rnorm(10,25,5),
temperature_sensor2 = rnorm(10,25,5),
temperature_sensor1 = rnorm(10,25,5),
pressure_sensor1 = rnorm(10,10,2),
pressure_sensor2 = rnorm(10,10,2),
pressure_sensor3 = rnorm(10,10,2))
How can I reshape it into the long format, such that each row has temperature and pressure data for a single sensor
Test Sensor Temperature Pressure
Thanks!

Here are a couple of approaches:
1) dplyr/tidyr Convert df to long form using gather and then separate the generated variable column by underscore into two columns. Finally convert from long to wide based on the variable column (which contains the strings pressure and temperature and value column (which contains the number):
library(dplyr)
library(tidyr)
df %>%
gather("variable", "value", -Test) %>%
separate(variable, c("variable", "sensor"), sep = "_") %>%
spread(variable, value)
2) Can use reshape. No packages needed. The line marked optional removes the row names. It could be omitted if that does not matter.
unames <- grep("_", names(df), value = TRUE)
varying <- split(unames, sub("_.*", "", unames))
sensors <- unique(sub(".*_", "", unames))
long <- reshape(df, dir = "long", varying = varying, v.names = names(varying),
times = sensors, timevar = "sensor")
rownames(long) <- NULL # optional
If df has fixed columns then we could simplify the above a bit by hard coding varying and sensors using these definitions in place of the more complex but general code above:
varying <- list(pressure = 2:4, temperature = 5:7)
sensors <- c("sensor1", "sensor2", "sensor3")
Note: To create df reproducibly we must set the seed first because random numbers were used so to be definite we created df like this. Also note that in the question temperature_sensor1 was used on two columns and we assumed that the second occurrence was intended to be temperature_sensor3.
set.seed(123)
df <- data.frame(
Test = 1:10,
temperature_sensor1=rnorm(10,25,5),
temperature_sensor2 = rnorm(10,25,5),
temperature_sensor3 = rnorm(10,25,5),
pressure_sensor1 = rnorm(10,10,2),
pressure_sensor2 = rnorm(10,10,2),
pressure_sensor3 = rnorm(10,10,2))

Related

How to manage column which doesn't exist while generating bar graph using pipe operator?

I have many dataframe school_skill_score_ff which has three column 'Skill_full_form', '2020', '2021'. I'm trying to create a pipeline for bar graph. My code is here.
school_bar<-school_skill_score_ff%>%
e_chart(Skill_full_form,backgroundColor='#0d1117',center=c('50%','35%'))%>%
e_bar(`2021`,label=list(show=TRUE,color='#fff',position='top'))%>%
e_bar(`2020`,label=list(show=TRUE,color='#fff',position='top'))%>%
#e_tooltip(trigger = "axis")%>%
e_tooltip(trigger = "axis",axisPointer=list(type='shadow'))%>%
#e_title("Skills Score") %>%
e_theme("westeros")%>%
e_toolbox_feature("dataZoom")%>%
e_animation(duration = 2000)%>%
e_hide_grid_lines('x')%>%
e_y_axis(splitLine=list(lineStyle=list(color='#0f375f')),axisLabel=list(fontSize=10,color='#fff',fontWeight='normal'),name='Skill Score',nameLocation='middle',nameGap=30,nameColor='#fff')%>%
e_x_axis(axisLabel=list(rotate=16,fontSize=10,color='#fff',fontWeight='normal'))%>%
#e_axis_labels(x='Skill Name', y = "Skill Score")%>%
e_y_axis(name='Skill Score',nameLocation='middle',nameGap=38,splitLine=list(lineStyle=list(color='#0f375f')))%>%
#e_title("School Skill Score",top='0%',left='45%',textStyle=list(color='#fff'))%>%
e_legend(top='8%',left='46%',textStyle=list(color='#fff'))%>%
#e_title("School Skill Graph",textStyle=list(color='#fff',fontWeight='normal'),left='center',top='5%') %>%
#e_x_axis(axisLabel=list(rotate=17,fontSize=12))%>%
e_grid(show=TRUE,top='4%',width='90%',left='5%')#,name='Skill Names',nameGap=60,nameLocation='middle')
My issue is that I'm fetching the file directly from an excel file and many files doesn't have column '2020'. This generates an error. I want to handle this error automatically such that graph for one column at least appear when 2nd column is absent. I'm using echarts4r library.
One approach to achieve your desired result would be to reshape your data to long format to add a column year to your data and group your data by year before passing it to e_charts. This way different number of years are treated automatically.
Using some toy example data this approach works fine for two years ...
library(echarts4r)
library(dplyr)
library(tidyr)
# Example data
school_skill_score_ff <- data.frame(
LETTERS[1:10],
1:10,
11:20
)
names(school_skill_score_ff) <- c("Skill_full_form", "2020","2021")
school_skill_score_ff_long <- school_skill_score_ff %>%
tidyr::pivot_longer(-Skill_full_form, names_to = "year", values_to = "score")
school_skill_score_ff_long %>%
group_by(year) %>%
e_chart(Skill_full_form,backgroundColor='#0d1117',center=c('50%','35%'))%>%
e_bar(score, label=list(show=TRUE,color='#fff',position='top'))
as well as for only one year present in the data:
# Now remove column 2020
school_skill_score_ff <- select(school_skill_score_ff, -`2020`)
school_skill_score_ff_long1 <- school_skill_score_ff %>%
tidyr::pivot_longer(-Skill_full_form, names_to = "year", values_to = "score")
school_skill_score_ff_long1 %>%
group_by(year) %>%
e_chart(Skill_full_form,backgroundColor='#0d1117',center=c('50%','35%'))%>%
e_bar(score, label=list(show=TRUE,color='#fff',position='top'))

Convert data frame columns to vectors

I have a dataframe named "Continents_tmap" where I want to return 3 vectors as per the following examples. Note:the "Values" needs to be Cases as per name in dataframe
labels = c("France","Germany","India", etc)
Parent = c("Europe","Europe","Asia",etc)
Values = c(100,345,456,etc)
My current code is as follows.
covid_1 <-
read.csv("C:/Users/Owner/Downloads/COVID-19 Activity.csv", stringsAsFactors = FALSE)
df1 <-
select(
covid_1,
REPORT_DATE,
COUNTRY_SHORT_NAME,
COUNTRY_ALPHA_3_CODE,
PEOPLE_DEATH_NEW_COUNT,
PEOPLE_POSITIVE_NEW_CASES_COUNT,
PEOPLE_DEATH_COUNT,
PEOPLE_POSITIVE_CASES_COUNT,
CONTINENT_NAME
)
Continents_tmap <- df1 %>%
group_by(Continent,Country.x) %>%
summarise(Deaths = sum(Deaths), Cases = sum(Positive_Cases))
Continents_tmap<- data.frame(Continents_tmap)

Making new variables for every group of observation in R

I have 11 variables in my dataframe. The first is unique identifier of observation (a plane). The second one is a number from 1 to 21 representing flight of a given plane. The rest of the variables are time, velocity, distance, etc.
What I want to do is make new variables for every group (number) of flight e.g. time_1, time_2,..., velocity_1, velocity_2, etc. and consequently, reduce the number of observations (the repeating ones).
I don't really have idea how to start. I was thinking about a mutate function like:
mutate(df, time_1 = ifelse(n_flight == 1, time, NA))
But that would be a lot of typing and a new problem may appear, perhaps.
Basically, you want to convert long to wide data for each variable. You can lapply over these with tidyr::spread in that case. Suppose the data looks like the following:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(rep("A", 3), rep("B", 3)),
n_flight = rep(seq(3), 2),
time = seq(19, 24),
velocity = rev(seq(65, 60))
)
Then the following will generate your outcome of interest, as long as you get rid of the extra ID variables.
lapply(
setdiff(names(df), c("ID", "n_flight")), function(x) {
df %>%
select(ID, n_flight, !!x) %>%
tidyr::spread(., key = "n_flight", value = x) %>%
setNames(paste(x, names(.), sep = "_"))
}
) %>%
bind_cols()
Let me know if this wasn't what you were going for.

R convert df from wide to long by splitting column names [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to convert the below df_original data.frame to the form in df_goal. The columns from the original data.frame shall be split, with the latter part acting as a key, while the first part shall be kept as a variable name. Preferably I would like to use a tidyverse-solution, but am open to every aproach. Thank you very much!
df_original <-
data.frame(id = c(1,2,3),
variable1_partyx = c(4,5,6),
variable1_partyy = c(14,15,16),
variable2_partyx = c(24,25,26),
variable2_partyy = c(34,35,36))
df_goal <-
data.frame(id = c(1,1,2,2,3,3),
key = c("partyx","partyy","partyx","partyy","partyx","partyy"),
variable1 = c(4,14,5,15,6,16),
variable2 = c(24,34,25,35,26,36))
df_original %>%
tidyr::gather(key, value, -id) %>%
tidyr::separate(key, into = c("var", "key"), sep = "_") %>%
tidyr::spread(var, value)

How to work around error while reshape data frame with spread()

I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing
Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem.
Please help me
Here is the code:
First chunk "example "makes small data set for good visualisation how it supiosed to look
Second chunk "real data" is sliced portion of data set from churn library
library(caret)
library(tidyr)
#example
#============
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") ,
flags = c(1, 1, 1, 1, 1, 1))
df
df2 <- spread(data = df, key = "factors" , value = flags, fill = " ")
df2
#=============
# real data
#============
data(churn)
str(churnTrain)
churnTrain <- churnTrain[1:250,1:4]
churnTrain$temporary <-1
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
str(churnTrain)
head(churnTrain3)
str(churnTrain3)
#============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr)
library(dplyr)
library(C50) # this is one source for the churn data
data(churn)
churnTrain <- churnTrain[1:250,1:4]
churnTrain2 <- churnTrain %>%
group_by(state, account_length, area_code, international_plan) %>%
tally %>%
dplyr::rename(temporary = n)
churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.
As others point out, you need to input a unique vector into spread. My solution is use base R:
library(C50)
f<- function(df, key){
if (sum(names(df)==key)==0) stop("No such key");
u <- unique(df[[key]])
id <- matrix(0,dim(df)[1],length(u))
uu <- lapply(df[[key]],function(x)which(u==x)) ## check 43697442 for details
for(i in 1:dim(df)[1]) id[i,uu[[i]]] <- 1
colnames(id) = as.character(u)
return(cbind(df,id));
}
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1"))
f(df, key='fact')
f(df, key='factors')
data(churn)
churnTrain <- churnTrain[1:250,1:4]
f(churnTrain, key='state')
Although you may see a for-loop and other temporary variables inside the f function, the speed is not slow indeed.

Resources