I have a dataframe with a vector of years and several columns which contain the gdp_per_head_values of different countries at a specific point in time. I want to mutate this dataframe to get a variable which contains only the values of the variable of the specific point in time defined by the vector of years.
My data.frame looks like this :
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000),
)
How can I mutate this dataframe as explained above, for example by the variable gpd_head
EDIT : Output should look like
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000)) %>%
mutate(gdp_head =c(.$'1940'[1],.$'1940'[2],.$'1960'[3],
.$'1950'[4],.$'1940'[5],.$'1960'[6],
.$'1960'[7],.$'1950'[8] ))
Here is one approach:
First, since you are going to compare the year_vector column with column names (which will be character), you can convert year_vector to character as well:
dataset$year_vector <- as.character(dataset$year_vector)
You currently have a tibble defined - but if you have it as a plain data.frame you can subset based on a [row, column] matrix and add the matched results as gdp_head:
dataset <- as.data.frame(dataset)
dataset$gdp_head <- as.numeric(dataset[cbind(1:nrow(dataset), match(dataset$year_vector, names(dataset)))])
I came up with the following solution which works aswell :
dataset %>%
do(.,mutate(.,gdp_head = pmap(list(1:nrow(.), year_vector),
function(x,y) .[x,(y-1901+16)]) %>%
unlist() ))
In this solution I just added the position of the first year variable to the column index and subtract that number from the year_vector. In this case the year variables start in the year 1901 which column index corresponds to 16.
Related
I have a data frame with a large number of columns containing numeric values.
I'd like to dynamically calculate the mean value of the two consecutive columns (so mean of column 1 and column 2, mean of column 3 and 4, mean of 5 and 6 etc...) and either store it into new column names or replace one of the two columns I used in the calculation.
I tried creating a function that calculate the mean of two columns and storing it into the first column then applying a loop to that function so it applies to my whole datatable.
However I'm struggling with mutate: since I dynamically generate the name of the column I use (they all start with "PUISSANCE" then a number) through a glue, it displays the value as a string into the mutate and doesn't evaluate it.
mean_col <- function(data, k)
{
n<-2*k+1
m<-2*k+2
varname_even <- paste("PUISSANCE", m,sep="")
varname_odd <- paste("PUISSANCE", n,sep="")
mutate(data, "{{varname_odd}}" := ({{varname_odd}}+{{varname_even}})/2) %% *here is the issue, the argument on the right is considered as non numeric, since it is the sum of two strings...*
data
}
for (k in 0:24) {
my_data_set <- mean_col(my_data_set,k)
}
Ok guys, just to let you know that I managed to solve it myself.
I did a pivot_longer transmute in order to put all the "PuissanceXX" in the same column and the values associated in another.
Then I used str_extract to get only the number XX from the string "PUISSANCEXX" that I converted into a numeric.
Thanks to a division by 2 (-0,5) I managed to have each successive value being X and X,5. so getting both values to X thanks to a floor. Then I just did a group_by/summarize in order to get the sum and that's it !
pivot_longer(starts_with("PUISSANCE"),names_to = "heure", values_to = "puissance") %>%
mutate("time" = floor(as.numeric(str_extract(heure, "\\d+"))/2-0.5)) %>%
select(-heure) %>%
group_by(time) %>%
summarise("power" = mean(puissance))
So, I have 6 data frames, all look like this (with different values):
Now I want to create a new column in all the data frames for the country. Then I want to convert it into a long df. This is how I am going about it.
dlist<- list(child_mortality,fertility,income_capita,life_expectancy,population)
convertlong <- function(trial){
trial$country <- rownames(trial)
trial <- melt(trial)
colnames(trial)<- c("country","year",trial)
}
for(i in dlist){
convertlong(i)
}
After running this I get:
Using country as id variables
Error in names(x) <- value :
'names' attribute [5] must be the same length as the vector [3]
That's all, it doesn't do the operations on the data frames. I am pretty sure I'm taking a stupid mistake, but I looked online on forums and cannot figure it out.
maybe you can replace
trial$country <- rownames(trial)
by
trial <- cbind(trial, rownames(trial))
Here's a tidyverse attempt -
library(tidyverse)
#Put the dataframes in a named list.
dlist<- dplyr::lst(child_mortality, fertility,
income_capita, life_expectancy,population)
#lst is not a typo!!
#Write a function which creates a new column with rowname
#and get's the data in long format
#The column name for 3rd column is passed separately (`col`).
convertlong <- function(trial, col){
trial %>%
rownames_to_column('country') %>%
pivot_longer(cols = -country, names_to = 'year', values_to = col)
}
#Use `imap` to pass dataframe as well as it's name to the function.
dlist <- imap(dlist, convertlong)
#If you want the changes to be reflected for dataframes in global environment.
list2env(dlist, .GlobalEnv)
My data has multiple customers data with different start and end dates along with their sales data.So I did simple exponential smoothing.
I applied the following code to apply ses
library(zoo)
library(forecast)
z <- read.zoo(data_set,FUN = function(x) as.Date(x) + seq_along(x) / 10^10 , index = "Date", split = "customer_id")
L <- lapply(as.list(z), function(x) ts(na.omit(x),frequency = 52))
HW <- lapply(L, ses)
Now my output class is list with uneven lengths.Can someone help me how to unnest or unlist the output in to a data frame and get the fitted values,actuals,residuals along with their dates,sales and customer_id.
Note : the reson I post my input data rather than data of HW is,the HW data is too large.
Can someone help me in R.
I would use tidyverse package to handle this problem.
map(HW, ~ .x %>%
as.data.frame %>% # convert each element of the list to data.frame
rownames_to_column) %>% # add row names as columns within each element
bind_rows(.id = "customer_id") # bind all elements and add customer ID
I am not sure how to relate dates and actual sales to your output (HW). If you explain it I might provide solution to that part of the problem too.
Firstly took all the unique customer_id into a variable called 'k'
k <- unique(data_set$customer_id)
Created a empty data frame
b <- data.frame()
extracted all the fitted values using a for loop and stored in 'a'.Using the rbind function attached all the fitted values to data frame 'b'
for(key in k){
print(a <- as.data.frame((as.numeric(HW_ses[[key]]$model$fitted))))
b <- rbind(b,a)
}
Finally using column bind function attached the input data set with data frame 'b'
data_set_final <- cbind(data_set,b)
I am trying to replace null values based on two columns. Basically, I have company codes in one column and its respective values in the second. I need to replace mean of the values for each of the company code rather than mean of the complete column. How do I do it in R? (Look at the image below)
Assuming your data is in a data frame called 'myData' you can go ahead and use the ddply function from the plyr package to generate the mean per company code. The ddply function applies a function to a column(s) grouped by another column(s).
library(plyr)
#Find the entries where the values are NULL, using "" (empty string) as NULL
#Can replace "" with whatever NULL is for you
nullMatches <- myData$Values == ""
#Generate the mean for each company
#This will return a 2 column data frame, first column will be "Symbol".
#Second column will the value of means for each 'Symbol'.
meanPerCompany <- ddply(myData[!nullMatches,], "Symbol", numcolwise(mean))
#Match the company symbol and store the mean
myData$Values[nullMatches] <- meanPerCompany[match(myData$Symbol[nullMatches], meanPerCompany[,1]),2]
Do you need something like this:
df <- data.frame(Symbol = c("NXCDX", "ALX", "ALX", "BESOQ", "BESOQ", "BESOQ"),
Values = c(2345, 8654, NA, 6394, 8549, NA))
df %>% dplyr::group_by(Symbol) %>% dplyr::summarise(mean_values = mean(Values, na.rm = TRUE))
using data.table
library(data.table)
setDT(df)[,replace(Values,is.na(Values),mean(Values,na.rm = T)),by=Symbol]
I have the following data structure:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Two dataframes.
map: with three variables. Chr, Pos and CM.
Data: with three variables: Chr, Pos.1, Pos.2
Matching Data$Pos.2 and Data$Pos.1 with map$Pos, I need to get the difference of map$CM values between these two matches. This procedure needs to be done by $Chr.
As an example: For the first row of Data (1,30,40) the desirable value would be 0.1010101 (this is obtained by the operation 0.39393939 – 0.29292929). for the first row of Data with Chr = 2 (2,4,9) the desirable value would be 0.06468352 (0.1026582-0.03797468).
Whether I well understood what you desire, I think you have to do something like this:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Using library tidyverse
library(tidyverse)
You have to tranform your data into dataframes:
Data <- as.data.frame(Data)
map <- as.data.frame(map)
Then you have just to retrieve information using left_join
Data_CM <- left_join(Data,map,by=c("Chr","Pos.1"="Pos")) %>%
rename(CM.1=CM)
Data_CM <- left_join(Data_CM,map,by=c("Chr","Pos.2"="Pos")) %>%
rename(CM.2=CM)
The Diff variable will compute the difference between two retrieved values
Data_CM <- Data_CM %>%
mutate(Diff=(CM.2-CM.1))