Mutate a dataframe by a vector which should match variable names

Mutate a dataframe by a vector which should match variable names - r

I have a dataframe with a vector of years and several columns which contain the gdp_per_head_values of different countries at a specific point in time. I want to mutate this dataframe to get a variable which contains only the values of the variable of the specific point in time defined by the vector of years.
My data.frame looks like this :
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000),
)
How can I mutate this dataframe as explained above, for example by the variable gpd_head
EDIT : Output should look like
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000)) %>%
mutate(gdp_head =c(.$'1940'[1],.$'1940'[2],.$'1960'[3],
.$'1950'[4],.$'1940'[5],.$'1960'[6],
.$'1960'[7],.$'1950'[8] ))

Here is one approach:
First, since you are going to compare the year_vector column with column names (which will be character), you can convert year_vector to character as well:
dataset$year_vector <- as.character(dataset$year_vector)
You currently have a tibble defined - but if you have it as a plain data.frame you can subset based on a [row, column] matrix and add the matched results as gdp_head:
dataset <- as.data.frame(dataset)
dataset$gdp_head <- as.numeric(dataset[cbind(1:nrow(dataset), match(dataset$year_vector, names(dataset)))])

I came up with the following solution which works aswell :
dataset %>%
do(.,mutate(.,gdp_head = pmap(list(1:nrow(.), year_vector),
function(x,y) .[x,(y-1901+16)]) %>%
unlist() ))
In this solution I just added the position of the first year variable to the column index and subtract that number from the year_vector. In this case the year variables start in the year 1901 which column index corresponds to 16.

Related

Dynamic mean of consecutive columns in dplyr

I have a data frame with a large number of columns containing numeric values.
I'd like to dynamically calculate the mean value of the two consecutive columns (so mean of column 1 and column 2, mean of column 3 and 4, mean of 5 and 6 etc...) and either store it into new column names or replace one of the two columns I used in the calculation.
I tried creating a function that calculate the mean of two columns and storing it into the first column then applying a loop to that function so it applies to my whole datatable.
However I'm struggling with mutate: since I dynamically generate the name of the column I use (they all start with "PUISSANCE" then a number) through a glue, it displays the value as a string into the mutate and doesn't evaluate it.
mean_col <- function(data, k)
{
n<-2*k+1
m<-2*k+2
varname_even <- paste("PUISSANCE", m,sep="")
varname_odd <- paste("PUISSANCE", n,sep="")
mutate(data, "{{varname_odd}}" := ({{varname_odd}}+{{varname_even}})/2) %% *here is the issue, the argument on the right is considered as non numeric, since it is the sum of two strings...*
data
}
for (k in 0:24) {
my_data_set <- mean_col(my_data_set,k)
}

Ok guys, just to let you know that I managed to solve it myself.
I did a pivot_longer transmute in order to put all the "PuissanceXX" in the same column and the values associated in another.
Then I used str_extract to get only the number XX from the string "PUISSANCEXX" that I converted into a numeric.
Thanks to a division by 2 (-0,5) I managed to have each successive value being X and X,5. so getting both values to X thanks to a floor. Then I just did a group_by/summarize in order to get the sum and that's it !
pivot_longer(starts_with("PUISSANCE"),names_to = "heure", values_to = "puissance") %>%
mutate("time" = floor(as.numeric(str_extract(heure, "\\d+"))/2-0.5)) %>%
select(-heure) %>%
group_by(time) %>%
summarise("power" = mean(puissance))

Changing dataframes in bulk? How to apply a list of operations to multiple dataframes?

So, I have 6 data frames, all look like this (with different values):
Now I want to create a new column in all the data frames for the country. Then I want to convert it into a long df. This is how I am going about it.
dlist<- list(child_mortality,fertility,income_capita,life_expectancy,population)
convertlong <- function(trial){
trial$country <- rownames(trial)
trial <- melt(trial)
colnames(trial)<- c("country","year",trial)
}
for(i in dlist){
convertlong(i)
}
After running this I get:
Using country as id variables
Error in names(x) <- value :
'names' attribute [5] must be the same length as the vector [3]
That's all, it doesn't do the operations on the data frames. I am pretty sure I'm taking a stupid mistake, but I looked online on forums and cannot figure it out.

maybe you can replace
trial$country <- rownames(trial)
by
trial <- cbind(trial, rownames(trial))

Here's a tidyverse attempt -
library(tidyverse)
#Put the dataframes in a named list.
dlist<- dplyr::lst(child_mortality, fertility,
income_capita, life_expectancy,population)
#lst is not a typo!!
#Write a function which creates a new column with rowname
#and get's the data in long format
#The column name for 3rd column is passed separately (`col`).
convertlong <- function(trial, col){
trial %>%
rownames_to_column('country') %>%
pivot_longer(cols = -country, names_to = 'year', values_to = col)
}
#Use `imap` to pass dataframe as well as it's name to the function.
dlist <- imap(dlist, convertlong)
#If you want the changes to be reflected for dataframes in global environment.
list2env(dlist, .GlobalEnv)

Unnest a ts class

My data has multiple customers data with different start and end dates along with their sales data.So I did simple exponential smoothing.
I applied the following code to apply ses
library(zoo)
library(forecast)
z <- read.zoo(data_set,FUN = function(x) as.Date(x) + seq_along(x) / 10^10 , index = "Date", split = "customer_id")
L <- lapply(as.list(z), function(x) ts(na.omit(x),frequency = 52))
HW <- lapply(L, ses)
Now my output class is list with uneven lengths.Can someone help me how to unnest or unlist the output in to a data frame and get the fitted values,actuals,residuals along with their dates,sales and customer_id.
Note : the reson I post my input data rather than data of HW is,the HW data is too large.
Can someone help me in R.

I would use tidyverse package to handle this problem.
map(HW, ~ .x %>%
as.data.frame %>% # convert each element of the list to data.frame
rownames_to_column) %>% # add row names as columns within each element
bind_rows(.id = "customer_id") # bind all elements and add customer ID
I am not sure how to relate dates and actual sales to your output (HW). If you explain it I might provide solution to that part of the problem too.

Firstly took all the unique customer_id into a variable called 'k'
k <- unique(data_set$customer_id)
Created a empty data frame
b <- data.frame()
extracted all the fitted values using a for loop and stored in 'a'.Using the rbind function attached all the fitted values to data frame 'b'
for(key in k){
print(a <- as.data.frame((as.numeric(HW_ses[[key]]$model$fitted))))
b <- rbind(b,a)
}
Finally using column bind function attached the input data set with data frame 'b'
data_set_final <- cbind(data_set,b)

How to replace na in R based on values in two columns

I am trying to replace null values based on two columns. Basically, I have company codes in one column and its respective values in the second. I need to replace mean of the values for each of the company code rather than mean of the complete column. How do I do it in R? (Look at the image below)

Assuming your data is in a data frame called 'myData' you can go ahead and use the ddply function from the plyr package to generate the mean per company code. The ddply function applies a function to a column(s) grouped by another column(s).
library(plyr)
#Find the entries where the values are NULL, using "" (empty string) as NULL
#Can replace "" with whatever NULL is for you
nullMatches <- myData$Values == ""
#Generate the mean for each company
#This will return a 2 column data frame, first column will be "Symbol".
#Second column will the value of means for each 'Symbol'.
meanPerCompany <- ddply(myData[!nullMatches,], "Symbol", numcolwise(mean))
#Match the company symbol and store the mean
myData$Values[nullMatches] <- meanPerCompany[match(myData$Symbol[nullMatches], meanPerCompany[,1]),2]

Do you need something like this:
df <- data.frame(Symbol = c("NXCDX", "ALX", "ALX", "BESOQ", "BESOQ", "BESOQ"),
Values = c(2345, 8654, NA, 6394, 8549, NA))
df %>% dplyr::group_by(Symbol) %>% dplyr::summarise(mean_values = mean(Values, na.rm = TRUE))

using data.table
library(data.table)
setDT(df)[,replace(Values,is.na(Values),mean(Values,na.rm = T)),by=Symbol]

How to subtract values by comparing columns from two datasets?

I have the following data structure:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Two dataframes.
map: with three variables. Chr, Pos and CM.
Data: with three variables: Chr, Pos.1, Pos.2
Matching Data$Pos.2 and Data$Pos.1 with map$Pos, I need to get the difference of map$CM values between these two matches. This procedure needs to be done by $Chr.
As an example: For the first row of Data (1,30,40) the desirable value would be 0.1010101 (this is obtained by the operation 0.39393939 – 0.29292929). for the first row of Data with Chr = 2 (2,4,9) the desirable value would be 0.06468352 (0.1026582-0.03797468).

Whether I well understood what you desire, I think you have to do something like this:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Using library tidyverse
library(tidyverse)
You have to tranform your data into dataframes:
Data <- as.data.frame(Data)
map <- as.data.frame(map)
Then you have just to retrieve information using left_join
Data_CM <- left_join(Data,map,by=c("Chr","Pos.1"="Pos")) %>%
rename(CM.1=CM)
Data_CM <- left_join(Data_CM,map,by=c("Chr","Pos.2"="Pos")) %>%
rename(CM.2=CM)
The Diff variable will compute the difference between two retrieved values
Data_CM <- Data_CM %>%
mutate(Diff=(CM.2-CM.1))