Removing "outer rows" to allow for interpolation (and prevent extrapolation) - r

I have (left)joined two data frames by country-year.
df<- left_join(df, df2, by="country-year")
leading to the following example output:
country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I initially wanted to remove all values for which both of the added columns (a,b) were NA.
df<-df[!is.na( df$a | df$b ),]
However, in second instance, I decided I wanted to interpolate the data I had (but not extrapolate). So instead I would like to remove all the columns for which I cannot interpolate; in the example:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I believe there are 2 options. First I somehow adapt this function:
library(tidyerse)
TRcomplete<-TRcomplete%>%
group_by(country) %>%
mutate_at(a:b,~na.fill(.x,"extend"))
to interpolate only, and then remove then apply df<-df[!is.na( df$a | df$b ),]
or I write a code to remove the "outer"columns first and then use extend like normal. Desired output:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
Any suggestions?

There are options in na.fill to specify what is done. If you look at ?na.fill, you see that fill can specify the left, interior and right, so if you specify the left and right are NA and the interior is "extend", then it will only fill the interior data. You can then filter the rows with NA.
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
By the way, you have a typo in your library(tidyverse) statement; you are missing the v.

Related

Impute only certain NA's for a variable in a data frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm new to R and exploring different beautiful options in it. I'm working on a data frame where I have a variable with 900 missing values, i.e NAs.
I want to impute 3 different values for NAs;
1st 300 NA's with Value 1.
2nd 300 NA's with Value 2.
3rd 300 NA's with Value 3.
There are a total of 23272 rows in the data.
dim(data)
[1] 23272 2
colSums(is.na(data))
month year
884 884
summary(data$month)
1 2 3 4 5 6 7 8 9 10 11 12 NA's
1977 1658 1837 1584 1703 1920 1789 2046 1955 2026 1845 2048 884
If we check the month 8,10 and 12. There is no much differences, Hence thought of assigning these 3 months to NA by splitting at the ratio (300:300:284). Usually we go my MODE, but I want to try this approach.
I assume you mean you a have a long list, some of the values of which are NAs:
set.seed(42)
df <- data.frame(val = sample(c(1:3, NA_real_), size = 1000, replace = TRUE))
We can keep a running tally of NA's and assign those to the imputed value using integer division with %/%.
library(tidyverse)
df2 <- df %>%
mutate(NA_num = if_else(is.na(val),
cumsum(is.na(val)),
NA_integer_),
imputed = NA_num %/% 100 + 1)
Output:
df2 %>%
slice(397:410) # based on manual examination using this seed
val NA_num imputed
1 NA 98 1
2 NA 99 1
3 3 NA NA
4 1 NA NA
5 1 NA NA
6 3 NA NA
7 3 NA NA
8 2 NA NA
9 NA 100 2
10 1 NA NA
11 NA 101 2
12 2 NA NA
13 1 NA NA
14 2 NA NA
Without an example, I think this will work.
Basically, filter the NAs to a new table, do the calc and merge it back. Assume the new_dt is the OG data where you filter to only contain the NAs
library('tidyverse');
new_dt = data.frame(x1 =rep(1:900), x2= NA) %>% filter(is.na(x2)) %>%
mutate(23 = case_when(row_number()%/%300==0 ~1,
row_number()%/%300==1 ~2,
row_number()%/%300==2 ~3))
dt <- rbind(dt,new_dt)

populating data based on column/rownames with uneqal row number [duplicate]

This question already has answers here:
Match values in data frame with values in another data frame and replace former with a corresponding pattern from the other data frame
(3 answers)
Closed 4 years ago.
I need to populate empty data frame with values based on values in first columns (or alternatively row names it is the same for me in this case). So here are three objects:
set.seed=11
empty_df=data.frame(cities=c("New York","London","Rome","Vienna","Amsterdam"),
col.a=rep(NA,5),
col.b=rep(NA,5),
col.c=rep(NA,5))
values=rnorm(4,0,1)
to_fill=data.frame(cities=c("New York","London","Vienna","Amsterdam"),
col.a=values)
desired_output=data.frame(cities=c("New York","London","Rome","Vienna","Amsterdam"),
col.a=c(values[1],values[2],NA,values[3],values[4]),
col.b=rep(NA,5),
col.c=rep(NA,5))
First column (it can be converted to row names, both solutions using row names or first column with city name is fine) consists some cities i like to visit and other some unspecified values. First is df I want to fill with values and its output is:
cities col.a col.b col.c
1 New York NA NA NA
2 London NA NA NA
3 Rome NA NA NA
4 Vienna NA NA NA
5 Amsterdam NA NA NA
Second is object I want put INTO empty df and as you can see it is missing one row (with "Rome"):
cities col1
1 New York 0.55213218
2 London 0.98907729
3 Vienna 1.11703741
4 Amsterdam -0.04616725
So now I want to put this inside empty df leaving NA in row which dose not match:
cities col.a col.b col.c
1 New York -0.62731870 NA NA
2 London -1.80206612 NA NA
3 Rome NA NA NA
4 Vienna -1.73446286 NA NA
5 Amsterdam -0.05709419 NA NA
I was trying to use simplest merge solution like this: merge(empty_df,to_fill, by="cities"):
cities col.a.x col.b col.c col.a.y
1 Amsterdam NA NA NA -0.05709419
2 London NA NA NA -1.80206612
3 New York NA NA NA -0.62731870
4 Vienna NA NA NA -1.73446286
And when i tried desired_output$col.a=merge(empty_df,to_fill, by="cities") error occurred(replacement has 4 rows, data has 5). Is there any simple solution to do this that can be put in for loop or apply?
We can use match:
empty_df$col.a <- to_fill$col.a[match(empty_df$cities, to_fill$cities)]
empty_df;
# cities col.a col.b col.c
#1 New York 1.5567564 NA NA
#2 London -0.6969401 NA NA
#3 Rome NA NA NA
#4 Vienna 1.3336636 NA NA
#5 Amsterdam 0.7329989 NA NA
We fill col.a of empty_df with col.a values from to_fill by matching cities from empty_df with cities from to_fill.

Replicating a Character String Up to a certain point in R Dataframe

I currently have the following dataframe below:
Country Information Export Import
Andorra Small 10 20
Medium 50 30
Large 40 50
Total NA 100 100
Antigua Small 60 70
Medium 20 10
Large 5 10
X-Large 15 10
Total NA 100 100
I would like to repeat the Country name up until it reaches the character string "Total", so i would have Andorra repeated for rows in the column named $Country up until it reaches the row "Total"
As you can see the rows differ for nearly every country ( i have 252 of them) so i need to find a way to ensure that the country name is repeated for that specific country up until it reaches "total"
(e.g. Antigua has 4 rows not 3 like Andorra - so would require Antigua to be repeated 4 times in the $Country column)
Is there a quick and efficient way to do this?
Any help is appreciated.
Thank you
I'm assuming you have NA values and not blank values in those cases that country values are missing.
You need to use function na.locf from package zoo and apply it on your country column, like this:
library(zoo)
# example of column values
country = c("Andorra",NA,NA,"Total","Antigua",NA,NA,NA,"Total")
# apply fucntion and update your variable
country = na.locf(country)
# see updated values
country
# [1] "Andorra" "Andorra" "Andorra" "Total" "Antigua" "Antigua" "Antigua" "Antigua" "Total"
What it does is replacing NA values with the previous non-NA value.
I would use the fill function from the tidyr package
Input Data
df <- data.table::fread("Country Information Export Import
Andorra Small 10 20
NA Medium 50 30
NA Large 40 50
Total NA 100 100
Antigua Small 60 70
NA Medium 20 10
NA Large 5 10
NA X-Large 15 10
Total NA 100 100")
Code to Fill in missing information using fill from tidyr
library(tidyr)
fill(df, Country, .direction = "down")
Output
Country Information Export Import
1: Andorra Small 10 20
2: Andorra Medium 50 30
3: Andorra Large 40 50
4: Total <NA> 100 100
5: Antigua Small 60 70
6: Antigua Medium 20 10
7: Antigua Large 5 10
8: Antigua X-Large 15 10
9: Total <NA> 100 100
If there are zero length string values, instead of NA, you can use the na_if function from the dplyr package to change them to NA
library(dplyr)
df %>%
mutate(Country = na_if(Country,"")) %>%
fill(Country, .direction = "down")

How can I overcome this error Error in tbl_vars(y) : argument "y" is missing, with no default?

I am trying to perform an inner join on 2 tables.
One is a hotel dataset which I have tokenized before using
df1 = read.csv("chennai.csv", header = TRUE, stringsAsFactors=FALSE)
library(dplyr)
library(tidytext)
hotel <- df1 %>% unnest_tokens(word,Review_Text)
data("stop_words")
hotel <- hotel %>%
anti_join(stop_words)
head(hotel)
Hotel_name Review_Title Sentiment
1 Accord Metropolitan Excellent comfortableness during stay 3
2 Accord Metropolitan Excellent comfortableness during stay 3
3 Accord Metropolitan Excellent comfortableness during stay 3
4 Accord Metropolitan Excellent comfortableness during stay 3
5 Accord Metropolitan Excellent comfortableness during stay 3
6 Accord Metropolitan Not too comfortable 1
Rating_Percentage X X.1 X.2 X.3 word
1 100 NA NA NA nice
2 100 NA NA NA stay
3 100 NA NA NA business
4 100 NA NA NA tourist
5 100 NA NA NA purpose
6 20 NA NA NA hotel
I have also used a simplified version of General Inquirer Dictionary spreadsheet
df <- read.csv("ib.csv", header=T, stringsAsFactors=FALSE)
dat <-subset(df, select=c(2,1))
head(dat)
word Scoree
1 A
2 ABANDON Negativ
3 ABANDONMENT Negativ
4 ABATE Negativ
5 ABATEMENT
6 ABDICATE Negativ
I have tried to do an inner_join where I encounter this error.
observation<- hotel %>%
+ inner_join(dat, by = "word") %>%
+ count(Scoree)

R getting rid of nested for loops

I did quite some searching on how to simplify the code for the problem below but was not successful. I assume that with some kind of apply-magic one could speed things up a little, but so far I still have my difficulties with these kind of functions ....
I have an data.frame data, structured as follows:
year iso3c gdpppc elec solid liquid heat
2010 USA 1567 1063 1118 835 616
2015 USA 1571 NA NA NA NA
2020 USA 1579 NA NA NA NA
... USA ... NA NA NA NA
2100 USA 3568 NA NA NA NA
2010 ARG 256 145 91 85 37
2015 ARG 261 NA NA NA NA
2020 ARG 270 NA NA NA NA
... ARG ... NA NA NA NA
2100 ARG 632 NA NA NA NA
As you can see, I have a historical starting value for 2010 and a complete scenario for gdppc up to 2100. I want to let values for elec, solid, liquid and heat grow according to some elasticity with respect to the development of gdppc, but separately for each country (coded in iso3c).
I have the elasticities defined in a separate data.frame parameters:
item value
elec 0.5
liquid 0.2
solid -0.1
heat 0.1
So far I am using a nested for loop:
for (e in 1:length(levels(parameters$item)){
for (c in 1:length(levels(data$iso3c)){
tmp <- subset(data, select=c("year", "iso3c", "gdppc", parameters[e, "item"]), subset=("iso3c" == levels(data$iso3c)[c]))
tmp[tmp$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <-
tmp[tmp$year == 2010, parameters[e, "item"]] *
cumprod((1 + (tmp[tmp$year %in% seq(2015, 2100, 5), "gdppc"] /
tmp[tmp$year %in% seq(2010, 2095, 5), "gdppc"] - 1) * parameters[e, "value"]))
data[data$iso3c == levels(data$iso3c)[i] & data$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <- tmp[tmp$year > 2010, parameters[e, "item"]]
}
}
The outer loop loops over the columns and the inner one over the countries. The inner loop runs for every country (I have 180+ countries). First, a subset containing data on one single country and on the variable of interest is selected. Then I let the respective variable grow with a certain elasticity to growth in gdppc and finally put the subset back into place in data.
I have already tried to let the outer loop run in parallel using foreach but was not succesful recombining the results. Since I have to run similar calculations quite often I would be very grateful for any help.
Thanks
Here's one way. Note I renamed your parameters data.frame to p
library(data.table)
library(reshape2)
dt <- data.table(data)
dt.melt = melt(dt,id=1:3)
dt.melt[,value:=as.numeric(value)] # coerce value column to numeric
dt.melt[,value:=head(value,1)+(gdpppc-head(gdpppc,1))*p[p$item==variable,]$value,
by="iso3c,variable"]
result <- dcast(dt.melt,iso3c+year+gdpppc~variable)
result
# iso3c year gdpppc elec solid liquid heat
# 1 ARG 2010 256 145.0 91.0 85.0 37.0
# 2 ARG 2015 261 147.5 90.5 86.0 37.5
# 3 ARG 2020 270 152.0 89.6 87.8 38.4
# 4 ARG 2100 632 333.0 53.4 160.2 74.6
# 5 USA 2010 1567 1063.0 1118.0 835.0 616.0
# 6 USA 2015 1571 1065.0 1117.6 835.8 616.4
# 7 USA 2020 1579 1069.0 1116.8 837.4 617.2
# 8 USA 2100 3568 2063.5 917.9 1235.2 816.1
The basic idea is to use the melt(...) function to reshape your original data into "long" format, where the values in the four columns solid, liquid, elec, and heat are all in one column, value, and the column variable indicates which metric value refers to. Now, using data tables, you can fill in the values easily. Then, reshape the result back into wide format using dcast(...).

Resources