I looking at foreign powers intervening into civil wars using R studio. My first dataset unit of analysis is conflict year while the second one is conflict month. I would need to have both of them in conflict years so I can merge them.
Is there any command that allows you to do the opposite of expanding rows?
It's hard to give you specifics without a sample of your data so we know what the structure is. I'm assuming your month-level dataset stores the month as a character string that includes a year. You should be able to extract the year with separate from the tidyr package:
library(tidyverse)
month <- c("June 2015", "July 2015", "September 2016", "August 2016", "March 2014")
conflict <- c("A", "B", "C", "D", "E")
my.data <- data.frame(month, conflict)
my.data
month conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E
my.data <- my.data %>%
separate(month, c("month", "year"), sep = " ")
> my.data
month year conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E
Related
I have a data frame ordered by month and year. I want to select only the integer number of years i.e. if the data start in July 2002 and ends in September 2010 then select only data from July 2002 to June 2010.
And if the data starts in September 1992 and ends in March 2000 then select only data from September 1992 to August 1999. Regardless of the missing months in between.
The data can be uploaded from the following link:
enter link description here
The code
mydata <- read.csv("E:/mydata.csv", stringsAsFactors=TRUE)
this is manually selection
selected.data <- mydata[1:73,] # July 2002 to June 2010
how to achieve that by coding.
Here is a base solution, that reproduce your manual subsetting:
mydata <- read.csv("D:/mydata.csv", stringsAsFactors=F)
lookup <-
c(
January = 1,
February = 2,
March = 4,
April = 4,
May = 5,
June = 6,
July = 7,
August = 8,
September = 9,
October = 10,
November = 11,
December = 12
)
mydata$Month <- unlist(lapply(mydata$Month, function(x) lookup[match(x, names(lookup))]))
first.month <- mydata$Month[1]
last.year <- max(mydata$Year)
mydata[1:which(mydata$Month==(first.month -1)&mydata$Year==last.year),]
Basically, I convert the Month name in number and find the month preceding the first month that appears in the dataframe, for the last year of the dataframe.
Here's a base R one-liner :
result <- mydata[seq_len(with(mydata, which(Month == month.name[match(Month[1],
month.name) - 1] & Year == max(Year)))), ]
head(result)
# Month Year var
#1 July 2002 -91.22997
#2 October 2002 -91.19007
#3 December 2002 -91.05395
#4 February 2003 -91.16958
#5 March 2003 -91.17881
#6 April 2003 -91.15110
tail(result)
# Month Year var
#68 December 2009 -90.92610
#69 January 2010 -91.07379
#70 February 2010 -91.12460
#71 March 2010 -91.10288
#72 April 2010 -91.06040
#73 June 2010 -90.94212
I have a datatable with three date columns x, y and z and I am trying to create a new column (new_col) that is the middle date of the three dates in each row once ranked from earliest to latest, i.e., I want the date between the min and max date – please see table below:
x
y
z
new_col
1st Jan 2005
4th May 1998
2nd Mar 2009
1st Jan 2005
9th May 2010
14th Feb 2003
9th Jan 2008
9th Jan 2008
7th Sept 2002
8th Dec 2010
23rd May 2012
8th Dec 2010
So, for rows 1, 2, and 3 I would like the dates from column x, z, and y, respectively. How can I go about this in R? I have used pmin and pmax but I can't isolate the date in the middle
Thanks in advance!
The approach below
coerces the character date strings to numeric type Date as there is no arithmetic with character dates,
finds the position of the "middle" date in each row
and returns the corresponding character string
which eventually becomes new_col.
This can be implemented using apply() on each row using an appropriate function:
df$new_col <- apply(df, 1L, function(x) x[order(lubridate::dmy(x))][2L])
df
x y z new_col
1 1st Jan 2005 4th May 1998 2nd Mar 2009 1st Jan 2005
2 9th May 2010 14th Feb 2003 9th Jan 2008 9th Jan 2008
3 7th Sept 2002 8th Dec 2010 23rd May 2012 8th Dec 2010
Note
This returns the expected result. new_col is a character date string.
However, if the OP intends to continue working with type Date, e.g. doing more arithmetic, I recommend to follow Ben's example and to coerce the whole data.frame to type Date and to stick to it.
First make sure all your dates are "Date" type, you can use dmy from lubridate for this (assumes your data frame is called df):
library(lubridate)
df[] <- lapply(df, dmy)
Next, sort each row in chronological order, and take the middle column (column 2) to be the new_col:
df$new_col <- as.Date(t(apply(df, 1, sort))[,2])
Finally, if you want the result to be displayed in same text format (e.g., "1st Jan 2005" instead of "2005-01-01") then you can use a custom function based on this answer:
library(dplyr)
date_to_text <- function(dates){
dayy <- day(dates)
suff <- case_when(dayy %in% c(11,12,13) ~ "th",
dayy %% 10 == 1 ~ 'st',
dayy %% 10 == 2 ~ 'nd',
dayy %% 10 == 3 ~'rd',
TRUE ~ "th")
paste0(dayy, suff, " ", format(dates, "%b %Y"))
}
df[] <- lapply(df, date_to_text)
Output
x y z new_col
1 1st Jan 2005 4th May 1998 2nd Mar 2009 1st Jan 2005
2 9th May 2010 14th Feb 2003 9th Jan 2008 9th Jan 2008
3 7th Sep 2002 8th Dec 2010 23rd May 2012 8th Dec 2010
Data
df <- structure(list(x = c("1st Jan 2005", "9th May 2010", "7th Sept 2002"
), y = c("4th May 1998", "14th Feb 2003", "8th Dec 2010"), z = c("2nd Mar 2009",
"9th Jan 2008", "23rd May 2012")), class = "data.frame", row.names = c(NA,
-3L))
After executing the R code, the values I got in the column of dataframe are:
25 July 2012 bet
22 June 2015 bet
09 April 2015 be
14 November 2016
I want only the dates, How can I remove "bet", "be" from the values?
I am using the below code to extract the above values from the text document:
coalesce((substr((stringr::str_match(text, "ISDA Master Agreement dated as of (.) ")[, 2]),1,16)),(substr((stringr::str_match(text, "ISDA Master Agreement dated as of (.) ")[, 2]),1,13)))
If I swipe the coalesce arguements, then the 4th value gets truncated.
I am ok with the code, but while cleaning, how should I remove the "bet","be"?
I am far away from being a regex expert, but here goes a tidyverse way of doing what you want:
library(tidyverse, verbose = F)
df <- tibble::tribble(
~V1, ~V2,
1L, "25 July 2012 bet",
2L, "22 June 2015 bet",
3L, "09 April 2015 be",
4L, "14 November 2016"
)
df %>%
mutate(V2 = str_replace(V2, pattern = "[:space:]be.*", replacement = ""))
#> # A tibble: 4 x 2
#> V1 V2
#> <int> <chr>
#> 1 1 25 July 2012
#> 2 2 22 June 2015
#> 3 3 09 April 2015
#> 4 4 14 November 2016
Created on 2020-02-21 by the reprex package (v0.3.0)
We can use sub to remove whitespace and everything with "be"
sub("\\s+be.*", "", c("25 July 2012 bet", "09 April 2015 be"))
#[1] "25 July 2012" "09 April 2015"
If you use lubridate you can strip away the excess text after the date:
library(lubridate)
test_strings <- c("25 July 2012 bet", "09 April 2015 be")
dmy(test_strings)
[1] "2012-07-25" "2015-04-09"
My time data are in this format:
datatimedf = data.frame(day_time = c('Apr 2005', '1992', "2004", "Jan 2001", "2015"))
I would like to add Jan in rows which only have year.
How is it possible to make it?
An example of expected output is this:
datatimedf = data.frame(day_time = c('Apr 2005', 'Jan 1992', "Jan 2004", "Jan 2001", "Jan 2015"))
What I have for only one row is this:
x[2,1] <- sub("^", "Jan ", x[2,1])
but how can I make it to the whole dataframe?
Here is a quick way to do it using dplyr:
library(dplyr)
datatimedf$day_time <- as.character(datatimedf$day_time)
datatimedf <- datatimedf %>%
transform(day_time = ifelse(nchar(day_time) == 4, paste("Jan", day_time), day_time))
#> day_time
#> 1 Apr 2005
#> 2 Jan 1992
#> 3 Jan 2004
#> 4 Jan 2001
#> 5 Jan 2015
For each line it checks if the length of the string is 4 and if so adds "Jan" to the beginning, otherwise it keeps the original. This isn't very applicable to other situations but it should get you started if you wanted to make it more generic and able to handle more types of input.
I have a data frame and a vector identifying the rows I'm interested in:
col1 <- c("June","June","June 11","June 11, 2012","June 14, 2012")
col2 <- c("September", "September", "October 8", "October", "Sept 27, 2012")
monthDF <- data.frame(cbind(col1, col2), stringsAsFactors = FALSE)
v0 <- c(1, 2)
I also have two character vectors I would like to apply to the subset and specific columns.
startMonthV <- paste(c(" 1, ", substr(Sys.Date(), 1, 4)), collapse = "")
endMonthV <- paste(c(" 28, ", substr(Sys.Date(), 1, 4)), collapse = "")
I've tried using variations of the apply() function with paste() being the function I want to use, but to no avail. I would like my final result to be a data frame with all of the rows, but looking like this - the first two rows have been modified with the above startMonthV and endMonthV:
col1 col2
1 June 1, 2016 September 28, 2016
2 June 1, 2016 September 28, 2016
3 June 11 October 8
4 June 11, 2012 October
5 June 14, 2012 Sept 27, 2012
I'm new to R, and was wondering if the apply() family would do, or using a function within the plyr package. Any stackoverflow answer I've found either applies to the whole data frame or collapses the data with the aggregate() function.
Thank you.
We can use mapply and assign the result back at the corresponding rows:
days <- c("1,", "28,")
monthDF[v0, ] <- mapply(paste, monthDF[v0, ], days, substr(Sys.Date(), 1, 4))
monthDF
col1 col2
1 June 1, 2016 September 28, 2016
2 June 1, 2016 September 28, 2016
3 June 11 October 8
4 June 11, 2012 October
5 June 14, 2012 Sept 27, 2012
Here we created a vector days according to the specific days you would like to append to the columns.