Combining rows of data into one with an uncommon aspect in R - r

I have this data frame that goes something similar to the following.
Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA
and so on.
The data frame like this repeats with different titles, and has a number appear in 'Number' in 2004 and 'Allocation' in 2001.
How would I go about changing the data so it turns into something a single row of the data frame
Code Title Number Allocation
1000 Jack 113 6
1001 Dave 101 19

This also works:
library(dplyr)
df %>%
select(-Year) %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct()
or:
df %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct(Code, Title, Number, Allocation)
Result:
# A tibble: 2 x 4
# Groups: Code, Title [2]
Code Title Number Allocation
<int> <fctr> <int> <int>
1 1000 Jack 113 6
2 1001 Dave 101 19
Data:
df = read.table(text=" Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA", header = TRUE)

Related

Add multiple columns lagged by one year

I need to add a 1-year-lagged version of multiple columns from my dataframe. Here's my data:
data<-data.frame(Year=c("2011","2011","2011","2012","2012","2012","2013","2013","2013"),
Country=c("America","China","India","America","China","India","America","China","India"),
Value1=c(234,443,754,334,117,112,987,903,476),
Value2=c(2,4,5,6,7,8,1,2,2))
And I want to add two columns that contain Value1 and Value2 at t-1, so that my dataframe looks like this:
How can I do this? Would this be the correct way to lag my variables by year?
Thanks in advance!
Using data.table:
library(data.table)
setDT(data)
cols <- grep("^Value", colnames(data), value = TRUE)
data[, paste0(cols, "_lag") := lapply(.SD, shift), .SDcols = cols, by = Country]
# Year Country Value1 Value2 Value1_lag Value2_lag
# 1: 2011 America 234 2 NA NA
# 2: 2011 China 443 4 NA NA
# 3: 2011 India 754 5 NA NA
# 4: 2012 America 334 6 234 2
# 5: 2012 China 117 7 443 4
# 6: 2012 India 112 8 754 5
# 7: 2013 America 987 1 334 6
# 8: 2013 China 903 2 117 7
# 9: 2013 India 476 2 112 8
In dplyr, use lag by group:
library(dplyr) #1.1.0
data %>%
mutate(across(contains("Value"), lag, .names = "{col}_lagged"), .by = Country)
Year Country Value1 Value2 Value1_lagged Value2_lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8
Below 1.1.0:
data %>%
group_by(Country) %>%
mutate(across(c(GDP, Population), lag, .names = "{col}_lagged")) %>%
ungroup()
Another way using dplyr to ge tthe job done.
library(dplyr)
data_lagged <- data %>%
group_by(Country) %>%
mutate(Value1_Lagged = lag(Value1),
Value2_Lagged = lag(Value2),
Year = as.integer(as.character(Year)) + 1)
data_final <- cbind(data, data_lagged[, c("Value1_Lagged", "Value2_Lagged")])
data_final
Output:
Year Country Value1 Value2 Value1_Lagged Value2_Lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8

Fill in Column Based on Other Rows (R)

I am looking for a way to fill in a column in R based on values in a different column. Below is what my data looks like.
year
action
player
end
2001
1
Mike
2003
2002
0
Mike
NA
2003
0
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
0
Alan
NA
I would like to either change the "action" column or create a new column such that it reflects the duration between the "year" and "end" variables. Below is what it would look like:
year
action
player
end
2001
1
Mike
2003
2002
1
Mike
NA
2003
1
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
1
Alan
NA
I have tried to do this with the following loop:
i <- 0
z <- 0
for (i in 1:nrow(df)){
i <- z + i + 1
if (df[i, 2] == 0) {}
else {df[i, 5] = (df[i, 4] - df[i, 1])}
z <- df[i,5]
for (z in i:nrow(df)){df[i, 2] = 1}
}
Here, my i value is skyrocketing, breaking the loop. I am not sure why that is occuring. I'd be interested to either know how to fix my approach or how to do this in a smarter fashion.
There's no need for explicit loops here.
First group your data frame by player. Then find the rows where the cumulative sum (cumsum) of action is greater than 0 and the year is less than or equal to the end year of the group. If the row meets these conditions, set action to 1, otherwise to 0.
Using the dplyr package you could achieve this in a couple of lines:
library(dplyr)
df %>%
group_by(player) %>%
mutate(action = as.numeric(cumsum(action) > 0 & year <= na.omit(end)[1]))
#> # A tibble: 8 x 4
#> # Groups: player [2]
#> year action player end
#> <int> <dbl> <chr> <int>
#> 1 2001 1 Mike 2003
#> 2 2002 1 Mike NA
#> 3 2003 1 Mike NA
#> 4 2004 0 Mike NA
#> 5 2001 0 Alan NA
#> 6 2002 0 Alan NA
#> 7 2003 1 Alan 2004
#> 8 2004 1 Alan NA

Interpolating missing data in a dataframe with R

I have a dataframe which is similar to the one below:
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 NA NA
3 France FR 2002 NA NA
4 France FR 2003 1600 2200
5 France FR 2004 NA NA
6 UK UK 2000 1000 1000
7 UK UK 2001 NA NA
8 UK UK 2002 1000 1000
9 UK UK 2003 1000 1000
10 UK UK 2004 1000 1000
I have previously used the following code to get the differences:
df <- df %>%
arrange(country, year) %>% #sort data
group_by(country) %>%
mutate_if(is.numeric, funs(d = . - lag(.)))
I would like expand on this code by calculating the difference between the data points of Happiness and Power, divide it by the difference in years between the data points and calculate the values to replace the NA's with, resulting in the following output.
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2002 1400 1800
4 France FR 2003 1600 2200
5 France FR 2004 NA NA
6 UK UK 2000 1000 1000
7 UK UK 2001 0 0
8 UK UK 2002 1000 1000
9 UK UK 2003 1000 1000
10 UK UK 2004 1000 1000
What would be an efficient way of carrying out this task?
EDIT: Please note that also France 2004 is NA. The extend function does seem to properly deal with such a situation.
EDIT 2: Adding the group_by(country) seems to mess things up for unknown reasons:It seems that the code is trying to convert a character to a numeric, although I do not really understand why. When I convert the column to character, the error becomes an evaluation error. Any suggestions?
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(TRcomplete$F116.s)
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(as.character(TRcomplete$F116.s))
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s))
Error: unexpected ')' in "TRcomplete$F116.s <- as.character(TRcomplete$F116.s))"
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s)
> str(TRcomplete$F116.s)
chr [1:6984] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Evaluation error: need at least two non-NA values to interpolate.
You can use na.fill with fill="extend" from the zoo library
rapply(df, zoo::na.fill,"integer",fill="extend",how="replace")
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2003 1400 1800
4 France FR 2004 1600 2200
5 UK UK 2000 1000 1000
6 UK UK 2001 1000 1000
7 UK UK 2003 1000 1000
8 UK UK 2004 1000 1000
EDIT:
library(tidyverse)
library(zoo)
df%>%
group_by(Country)%>%
mutate_at(4:5,~na.fill(.x,"extend"))
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2003 1400 1800
4 France FR 2004 1600 2200
5 UK UK 2000 1000 1000
6 UK UK 2001 1000 1000
7 UK UK 2003 1000 1000
8 UK UK 2004 1000 1000
If all the elements in the group are NA then:
df%>%
group_by(Country)%>%
mutate_if(is.numeric,~if(all(is.na(.x))) NA else na.fill(.x,"extend"))

How to order the rows information of a data set with two criteria

I have a data set containing information about academic degrees per year, like this:
Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA
I want to obtain a data frame that contains the year and the highest academic degree obtained just before 2015, like this:
YearX Highest_Degree
2004 Master
2010 PHD
2006 Master
NA NA
2004 Master
2014 Master
Ugh, what a terrible data format. We add an ID column, clean it up, and then we can get what you want in a few lines.
library(tidyr)
library(dplyr)
library(stringr)
# create ID column
mutate(dd, id = 1:n()) %>%
# convert degree and year columns to long format
gather(dd, key = "degkey", value = "degree", starts_with("Deg")) %>%
gather(key = "yearkey", value = "year", starts_with("Year")) %>%
# pull the numbers into an index
mutate(yr_index = str_extract(yearkey, "[0-9]+"),
deg_index = str_extract(degkey, "[0-9]+")) %>%
# get rid of junk and filter to the years you want
filter(yr_index == deg_index, year < 2015) %>%
# order by descending index
arrange(desc(yr_index)) %>%
# keep relevant columns
select(id, degree, year) %>%
# for each ID, keep the top row
group_by(id) %>%
slice(1) %>%
# join back to the original to complete any lost IDs
right_join(select(dd, id))
# Joining, by = "id"
# # A tibble: 6 x 3
# # Groups: id [?]
# id degree year
# <int> <chr> <int>
# 1 1 Master 2004
# 2 2 PHD 2010
# 3 3 College 2006
# 4 4 <NA> NA
# 5 5 Master 2004
# 6 6 Master 2014
# Warning message:
# attributes are not identical across measure variables; they will be dropped
Using this data:
dd = read.table(text = "Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA",
header = T)

R issues with merge/rbind/concatenate two data frames

I am a beginner with R so i apologise in advance if the question was asked elsewhere. Here is my issue:
I have two data frames, df1 and df2, with different number of rows and columns. The two frames have only one variable (column) in common called "customer_no". I want the merged frame to match records based on "customer_no" and by rows in df2 only.Both data.frames have multiple rows for each customer_no.
I tried the following:
merged.df <- (df1, df2, by="customer_no",all.y=TRUE)
The problem is that this assigns values of df1 to df2 where instead it should be empty. My questions are:
1) How can I tell the command to leave the unmatched columns empty?
2) How can I see from the merged file which row came from which df? I guess if I resolve the above question this should be easy to see by the empty columns.
I am missing something in my command but don't know what. If the question has been answered somewhere else, would you be still kind enough to rephrase it in English here for an R beginner?
Thanks!
Data example:
df1:
customer_no country year
10 UK 2001
10 UK 2002
10 UK 2003
20 US 2007
30 AU 2006
df2:
customer_no income
10 700
10 800
10 900
30 1000
Merged file should look like this:
merged.df:
customer_no income country year
10 UK 2001
10 UK 2002
10 UK 2003
10 700
10 800
10 900
30 AU 2006
30 1000
So:
It puts the columns all together, it adds the values of df2 right after the last one of df1 based on same customer_no and matches only customer_no from df2 (merged.df does not have customer_no 20). Also, it leaves empty all the other cells.
In STATA I use append but not sure in R...perhaps join?
THANKS!!
Try:
df1$id <- paste(df1$customer_no, 1, sep="_")
df2$id <- paste(df2$customer_no, 2, sep="_")
res <- merge(df1, df2, by=c('id', 'customer_no'),all=TRUE)[,-1]
res1 <- res[res$customer_no %in% df2$customer_no,]
res1
# customer_no country year income
#1 10 UK 2001 NA
#2 10 UK 2002 NA
#3 10 UK 2003 NA
#4 10 <NA> NA 700
#5 10 <NA> NA 800
#6 10 <NA> NA 900
#8 30 AU 2006 NA
#9 30 <NA> NA 1000
If you want to change NA to '',
res1[is.na(res1)] <- '' #But, I would leave it as `NA` as there are `numeric` columns.
Or, use rbindlist from data.table (Using the original datasets)
library(data.table)
indx <- df1$customer_no %in% df2$customer_no
rbindlist(list(df1[indx,], df2),fill=TRUE)[order(customer_no)]
# customer_no country year income
#1: 10 UK 2001 NA
#2: 10 UK 2002 NA
#3: 10 UK 2003 NA
#4: 10 NA NA 700
#5: 10 NA NA 800
#6: 10 NA NA 900
#7: 30 AU 2006 NA
#8: 30 NA NA 1000
You could also use the smartbind function from the gtools package.
require(gtools)
res <- smartbind(df1[df1$customer_no %in% df2$customer_no, ], df2)
res[order(res$customer_no), ]
# customer_no country year income
# 1:1 10 UK 2001 NA
# 1:2 10 UK 2002 NA
# 1:3 10 UK 2003 NA
# 2:1 10 <NA> NA 700
# 2:2 10 <NA> NA 800
# 2:3 10 <NA> NA 900
# 1:4 30 AU 2006 NA
# 2:4 30 <NA> NA 1000
Try:
df1$income = df2$country = df2$year = NA
rbind(df1, df2)
customer_no country year income
1 10 UK 2001 NA
2 10 UK 2002 NA
3 10 UK 2003 NA
4 20 US 2007 NA
5 30 AU 2006 NA
6 10 <NA> NA 700
7 10 <NA> NA 800
8 10 <NA> NA 900
9 30 <NA> NA 1000

Resources