Transforming big dataframe in more sensible form [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Reshaping wide to long with multiple values columns [duplicate]
(5 answers)
Closed 1 year ago.
Dataframe consist of 3 rows: wine_id, taste_group and and evaluated matching score for each of that group:
wine_id
taste_group
score
22
tree_fruit
87
22
citrus_fruit
98
22
tropical_fruit
17
22
earth
8
22
microbio
6
22
oak
7
22
vegetal
1
How to achieve to make a separate column for each taste_group and to list scores in rows?
Hence this:
wine_id
tree_fruit
citrus_fruit
tropical_fruit
earth
microbio
oak
vegetal
22
87
98
17
8
6
7
1
There are 13 taste groups overall, along with more than 6000 Wines.
If the wine doesn't have a score for taste_group row takes value 0.
I used
length(unique(tastes$Group))
length(unique(tastes$Wine_Id))
in R to question basic measures.
How to proceed to wanted format?

Assuming your dataframe is named tastes, you'll want something like:
library(tidyr)
tastes %>%
# Get into desired wide format
pivot_wider(names_from = taste_group, values_from = score, values_fill = 0)

In R, this is called as the long-to-wide reshaping, you can also use dcast to do that.
library(data.table)
dt <- fread("
wine_id taste_group score
22 tree_fruit 87
22 citrus_fruit 98
22 tropical_fruit 17
22 earth 8
22 microbio 6
22 oak 7
22 vegetal 1
")
dcast(dt, wine_id ~ taste_group, value.var = "score")
#wine_id citrus_fruit earth microbio oak tree_fruit tropical_fruit vegetal
# <int> <int> <int> <int> <int> <int> <int> <int>
# 22 98 8 6 7 87 17 1

Consider reshape:
wide_df <- reshape(
my_data,
timevar="taste_group",
v.names = "score",
idvar = "wine_id",
direction = "wide"
)

Related

How to find duplicate dates within a row in R, and then replace associated values with the mean?

There are some similar questions, however I haven't been able to find the solution for my data:
ID <- c(27,46,72)
Gest1 <- c(27,28,29)
Sys1 <- c(120,123,124)
Dia1 <- c(90,89,92)
Gest2 <- c(29,28,30)
Sys2 <- c(122,130,114)
Dia2 <- c(89,78,80)
Gest3 <- c(32,29,30)
Sys3 <- c(123,122,124)
Dia3 <- c(90,88,89)
Gest4 <- c(33,30,32)
Sys4 <- c(124,123,128)
Dia4 <- c(94,89,80)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3,
Dia3,Gest4,Sys4,Dia4)
df.1
What I need to do is identify where there are any cases of gestational age duplicates (variables beginning with Gest), and then find the mean of the associated Sys and Dia variables.
Once the mean has been calculated, I need to replace the duplicates with just 1 Gest variable, and the mean of the Sys variable and the mean of the Dia variable. Everything after those duplicates should then be moved up the dataframe.
Here is what it should look like:
df.2
My real data has 25 Gest variables with 25 associated Sys variables and 25 association Dia variables.
Sorry if this is confusing! I've tried to write an ok question but it is my first time using stack overflow.
Thank you!!
This is easier to manage in long (and tidy) format.
Using tidyverse, you can use pivot_longer to put into long form. After grouping by ID and Gest you can substitute Sys and Dia values with the mean. If there are more than one Gest for a given ID it will then use the average.
Then, you can keep that row of data with slice. After grouping by ID, you can renumber after combining those with common Gest values.
library(tidyverse)
df.1 %>%
pivot_longer(cols = -ID, names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
group_by(ID, Gest) %>%
mutate(across(c(Sys, Dia), mean)) %>%
slice(1) %>%
group_by(ID) %>%
mutate(number = row_number())
Output
ID number Gest Sys Dia
<dbl> <int> <dbl> <dbl> <dbl>
1 27 1 27 120 90
2 27 2 29 122 89
3 27 3 32 123 90
4 27 4 33 124 94
5 46 1 28 126. 83.5
6 46 2 29 122 88
7 46 3 30 123 89
8 72 1 29 124 92
9 72 2 30 119 84.5
10 72 3 32 128 80
Note - I would keep in long form - but if you wanted wide again, you can add:
pivot_wider(id_cols = ID, names_from = number, values_from = c(Gest, Sys, Dia))
This involved change the structure of the table into the long format, averaging the duplicates and then reformatting back into the desired table:
library(tidyr)
library(dplyr)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3, Dia3,Gest4,Sys4,Dia4)
#convert data to long format
longdf <- df.1 %>% pivot_longer(!ID, names_to = c(".value", "time"), names_pattern = "(\\D+)(\\d)", values_to="count")
#average duplicate rows
temp<-longdf %>% group_by(ID, Gest) %>% summarize(Sys=mean(Sys), Dia=mean(Dia)) %>% mutate(time = row_number())
#convert back to wide format
answer<-temp %>% pivot_wider(ID, names_from = time, values_from = c("Gest", "Sys", "Dia"), names_glue = "{.value}{time}")
#resort the columns
answer <-answer[ , names(df.1)]
answer
# A tibble: 3 × 13
# Groups: ID [3]
ID Gest1 Sys1 Dia1 Gest2 Sys2 Dia2 Gest3 Sys3 Dia3 Gest4 Sys4 Dia4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 27 27 120 90 29 122 89 32 123 90 33 124 94
2 46 28 126. 83.5 29 122 88 30 123 89 NA NA NA
3 72 29 124 92 30 119 84.5 32 128 80 NA NA NA

Convert data frame with year column to timeseries data, sorted by observation [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 8 months ago.
I have the following data.frame, which I want to convert into 2 separate timeseries data frames for revenue and cost.
df1 = data.frame(year = c('2018','2019', '2020','2019','2020','2021'),
company=c('x','x','x','y','y','z'),
revenue=c(45,78,13,89,48,70),
cost=c(100,120,130,140,160,164),
stringsAsFactors=FALSE)
df1
year company revenue cost
1 2018 x 45 100
2 2019 x 78 120
3 2020 x 13 130
4 2019 y 89 140
5 2020 y 48 160
6 2021 z 70 164
If I want to create a new data frame for the revenue data with the data arranged as so, and n.a. to replace all years in which the data is not available, what codes can I use to do this?
2018 2019 2020 2021
1 x 45 78 13 n.a.
2 y n.a. 89 48 n.a.
3 z n.a. n.a. n.a. 70
With the tidyverse...
df1 %>% filter(company == 'x') %>% pivot_wider(values_from = revenue, names_from = year)
If you are trying to get both revenue and costs as you imply
library(tidyr)
df2 <- pivot_wider(df1, names_from = year, values_from = c(revenue,cost))
gets what you need, I think. Cols 2-5 are the revenues and Cols 6-9 are the costs.

Find lowest value in three columns in R [duplicate]

This question already has answers here:
min for each row in a data frame
(4 answers)
Closed 2 years ago.
I have a dataframe with 400 people who each have three predicted values (so 400 rows, 3 columns). Now I need a function that writes me the lowest of these three values into a variable, so that every person has the best prediction in a fourth column. I can't find any possibility, so I would be very thankful for your help!
Imagine you had 3 columns named Score1, Score2, and Score3. You might use apply as follows:
data$MinScore <- apply(data[,c("Score1","Score2","Score3")],1,min)
head(data)
Person Score1 Score2 Score3 MinScore
1 Person1 11 90 73 11
2 Person2 60 85 76 60
3 Person3 20 16 36 16
4 Person4 95 87 66 66
5 Person5 99 81 20 20
6 Person6 42 79 80 42
Sample Data
data <- data.frame(Person = paste0("Person", 1:400),Score1 = sample(1:100,100),Score2 = sample(1:100,100),Score3 = sample(1:100,100))

How to merge rows in a data frame when the values in a column are the same in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a number of observations from the same unit, and I need to merge the rows. So a data frame like
data.frame(
fir =c("001","006","001", "006", "062"),
sec = c(10,5,6,7,8),
thd = c(45,67,84,54,23))
fir sec thd
001 10 45
006 5 67
001 6 84
006 7 54
062 8 23
The first column has a 3 digit number representing a unit. I need to add the rows together to get a total for each unit. The other columns are numeric values that need adding together. So the dataframe would look like,
fir sec thd
001 16 129
006 12 121
062 8 23
I need it to work for any unique number in the first column.
Any ideas? Thank you for any help!
welcome this is a classic case of a group by operation, we can group your logic by group in this case we want the sum of the sec and thd columns.
library(tidyverse)
df <- data.frame(
fir =c("001","006","001", "006", "062"),
sec = c(10,5,6,7,8),
thd = c(45,67,84,54,23))
df %>%
group_by(fir) %>%
summarise(sec_sum = sum(sec),
thd_sum = sum(thd))
We can do a group by 'sum'
library(dplyr)
df1 %>%
group_by(fir) %>%
summarise_all(sum)
# A tibble: 3 x 3
# fir sec thd
# <fct> <dbl> <dbl>
#1 001 16 129
#2 006 12 121
#3 062 8 23
Or with aggregate from base R
aggregate(. ~ fir, df1, sum)
data
df1 <- data.frame(
fir =c("001","006","001", "006", "062"),
sec = c(10,5,6,7,8),
thd = c(45,67,84,54,23))

How to assign a value depending on two conditions including column names. (add environmental variable to tracking data)

I have a data frame (track) with the position (longitude - Latitude) and date (number of the day in the year) of tracking point for different animals and an other data frame (var) which gives a the mean temperature for every day of the year in different locations.
I would like to add a new column TEMP to my data frame (Track) where the value would be from (var) and correspond to the date and GPS location of each tracking points in (track).
Here are a really simple subset of my data and what I would like to obtain.
track = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5))
Var = data.frame(
Longitude=c(117,117,116,116),
Latitude=c(18,20,18,20),
Day1=c(22,23,24,21),
Day2=c(21,28,27,29),
Day3=c(12,13,14,11),
Day4=c(17,19,20,23),
Day5=c(32,33,34,31)
)
TrackPlusVar = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5),
Temp= c(22,11,19,22,31)
)
I've no idea how to assign the value from the same date and GPS location as it is a column name. Any idea would be very useful !
This is a dplyr and tidyr approach.
library(dplyr)
library(tidyr)
# reshape table Var
Var %>%
gather(Day,Temp,-Longitude, -Latitude) %>%
mutate(Day = as.numeric(gsub("Day","",Day))) -> Var2
# join tables
track %>% left_join(Var2, by=c("Longitude", "Latitude", "Day"))
# animals Longitude Latitude Day Temp
# 1 1 117 18 1 22
# 2 1 116 20 3 11
# 3 1 117 20 4 19
# 4 2 117 18 1 22
# 5 2 116 20 5 31
If the process that creates your tables makes sure that all your cases belong to both tables, then you can use inner_join instead of left_join to make the process faster.
If you're still not happy with the speed you can use a data.table join process to check if it is faster, like:
library(data.table)
Var2 = setDT(Var2, key = c("Longitude", "Latitude", "Day"))
track = setDT(track, key = c("Longitude", "Latitude", "Day"))
Var2[track][order(animals,Day)]
# Longitude Latitude Day Temp animals
# 1: 117 18 1 22 1
# 2: 116 20 3 11 1
# 3: 117 20 4 19 1
# 4: 117 18 1 22 2
# 5: 116 20 5 31 2

Resources