I want to find duplicates horizontally and keeping the uniques. Please help me with this.
I am sharing a sample dataset. Hope this helps.
X <- c(1,2,3,4,5)
Y <- c("India","India","Philippines","Netherlands","France")
Z <- c("India","India","Netherlands","France","France")
S <- c("India","France","Netherlands","France","India")
TableTest <- data.frame(X,Y,Z,S)
TableTest
Input dataset
X Y Z S
1 1 India India India
2 2 India India France
3 3 Philippines Netherlands Netherlands
4 4 Netherlands France France
5 5 France France India
Expected Output
X Y Z S
1 1 India NA NA
2 2 India France NA
3 3 Philippines Netherlands NA
4 4 Netherlands France NA
5 5 France India NA
Please help.
TableTest[,-1] <- as.data.frame(t(apply(TableTest[,-1], 1, function(a) { a <- replace(a, duplicated(a), NA_character_); a[ order(is.na(a)) ]; })))
TableTest
# X Y Z S
# 1 1 India <NA> <NA>
# 2 2 India France <NA>
# 3 3 Philippines Netherlands <NA>
# 4 4 Netherlands France <NA>
# 5 5 France India <NA>
Another base R option
TableTest[-1] <- do.call(rbind,lapply(apply(TableTest[-1],1,unique),`length<-`,ncol(TableTest)-1))
or a simpler version (thanks for advice by #Onyambu in the comments)
TableTest[-1] <- t(apply(TableTest[-1], 1, function(x)`length<-`(unique(x),ncol(TableTest[-1]))))
which gives
> TableTest
X Y Z S
1 1 India <NA> <NA>
2 2 India France <NA>
3 3 Philippines Netherlands <NA>
4 4 Netherlands France <NA>
5 5 France India <NA>
My solution:
TableTest[2:4] <- as.data.frame(t(apply(TableTest[2:4], 1, function(x) {
xo <- ifelse(!duplicated(x), x, NA_character_)
if (any(is.na(xo))) xo <- xo[!is.na(xo)]
length(xo) <- ncol(TableTest) - 1
xo
})))
Output
> TableTest
X Y Z S
1 1 India <NA> <NA>
2 2 India France <NA>
3 3 Philippines Netherlands <NA>
4 4 Netherlands France <NA>
5 5 France India <NA>
I don't think you can do it by only using data.frames, because you're moving values across columns. But here's one way to do it using matrices:
X <- c(1,2,3,4,5)
Y <- c("India","India","Philippines","Netherlands","France")
Z <- c("India","India","Netherlands","France","France")
S <- c("India","France","Netherlands","France","India")
output <- apply(cbind(Y,Z,S), 1, function(row) {
rm_dup <- unique(row)
return(c(rm_dup, rep(NA_character_,
3 - length(rm_dup))))
})
t(output)
[,1] [,2] [,3]
[1,] "India" NA NA
[2,] "India" "France" NA
[3,] "Philippines" "Netherlands" NA
[4,] "Netherlands" "France" NA
[5,] "France" "India" NA
Related
My data is like this:
country supporter1 supporter2 supporter3 supporter4 supporter5
USA Albania Germany USA NA NA
France USA France NA NA NA
UK UK Chile Peru NA NA
Germany USA Iran Mexico India Pakistan
USA China Spain NA NA NA
Cuba Cuba UK Germany South Korea NA
China Russia NA NA NA NA
What I want to do is to create a new variable when the country column and one of the remaining supporter columns (supporter 1, supporter 2, supporter 3, supporter 4, and supporter 5) are the same (for instance country France and supporter2 France are the same). In this case, the new variable should take 1, 0 otherwise.
I expect to have this:
country supporter1 supporter2 supporter3 supporter4 supporter5 new variable
USA Albania Germany USA NA NA 1
France USA France NA NA NA 1
UK UK Chile Peru NA NA 1
Germany USA Iran Mexico India Pakistan 0
USA China Spain NA NA NA 0
Cuba Cuba UK Germany South Korea NA 1
China Russia NA NA NA NA 0
Update dplyr only solution Using if_any:
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.integer(as.logical(if_any(starts_with("supporter"), ~ . %in% country))))
country supporter1 supporter2 supporter3 supporter4 supporter5 new_var
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 USA Albania Germany USA NA NA 1
2 France USA France NA NA NA 1
3 UK UK Chile Peru NA NA 1
4 Germany USA Iran Mexico India Pakistan 0
5 USA China Spain NA NA NA 0
6 Cuba Cuba UK Germany South Korea NA 1
7 China Russia NA NA NA NA 0
First answer: also correct:
Here is one possible solution:
calculate rowwise
check in cols supporter1 to supporter5 if country is included
unite all new columns to one and with an ifelse statement take 1 or 0
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rowwise() %>%
mutate(across(supporter1:supporter5, ~ifelse(. %in% country, 1,0), .names = "new_{col}")) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
mutate(New_Col = ifelse(str_detect(New_Col, "1"), 1,0))
country supporter1 supporter2 supporter3 supporter4 supporter5 New_Col
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 USA Albania Germany USA NA NA 1
2 France USA France NA NA NA 1
3 UK UK Chile Peru NA NA 1
4 Germany USA Iran Mexico India Pakistan 0
5 USA China Spain NA NA NA 0
6 Cuba Cuba UK Germany South Korea NA 1
7 China Russia NA NA NA NA 0
Here is a base R solution.
First mapply checks for equality of suporter* and country. NA's are considered to return FALSE. Then as.integer/rowSums transforms rows with at least one TRUE into 1, otherwise 0.
eq <- mapply(\(x, y){x == y & !is.na(x)}, df1[-1], df1[1])
as.integer(rowSums(eq) != 0)
#[1] 1 1 1 0 0 1 0
df1$new_variable <- as.integer(rowSums(eq) != 0)
Data
df1 <- read.table(text = "
country supporter1 supporter2 supporter3 supporter4 supporter5
USA Albania Germany USA NA NA
France USA France NA NA NA
UK UK Chile Peru NA NA
Germany USA Iran Mexico India Pakistan
USA China Spain NA NA NA
Cuba Cuba UK Germany 'South Korea' NA
China Russia NA NA NA NA
", header = TRUE)
Another solution is checking per row whether country is present in one of the columns:
df <- data.frame(country=c("USA","France","UK","Germany","USA","Cuba","China"),
supporter1=c("Albania","USA","UK","USA","China","Cuba","Russia"),
supporter2=c("Germany","France","Chile","Iran","Spain","UK","NA"),
supporter3=c("USA","NA","Peru","Mexico","NA","Germany","NA"),
supporter4=c("NA","NA","NA","India","NA","South Korea","NA"),
supporter5=c("NA","NA","NA","Pakistan","NA","NA","NA"))
That would give:
df$new <- sapply(seq(1,nrow(df)), function(x) ifelse(df$country[x] %in% df[x,2:6],1,0))
> df$new
[1] 1 1 1 0 0 1 0
I have a dataframe with information of some countries and states like this:
data.frame("state1"= c(NA,NA,"Beijing","Beijing","Schleswig-Holstein","Moskva",NA,"Moskva",NA,"Berlin"),
"country1"=c("Spain","Spain","China","China","Germany","Russia","Germany","Russia","Germany","Germany"),
"state2"= c(NA,NA,"Beijing",NA,NA,NA,"Moskva",NA,NA,NA),
"country2"=c("Germany","Germany","China","Germany","","Ukraine","Russia","Germany","Ukraine","" ),
"state3"= c(NA,NA,NA,NA,"Schleswig-Holstein",NA,NA,NA,NA,"Berlin"),
"country3"=c("Spain","Spain","Germany","Germany","Germany","Germany","Germany","Germany","Germany","Germany"))
Now, I would like to create a new column with the information of German states. (the result would look like below).
When at least one of the three variables state are a German state, assign it in the new variable.
data.frame("GE_State"=c(NA,NA,NA,NA, "Schleswig-Holstein",NA,NA,NA,NA,"Berlin"))
Please help a beginner for the condition setting.
Thank you in advance!
Using dplyr::mutate() with case_when() works, although I suspect there should be a more efficient way using across()
library(dplyr)
df %>%
mutate(GE_state = case_when(country1 == "Germany" & !is.na(state1) ~ state1,
country2 == "Germany" & !is.na(state2) ~ state2,
country3 == "Germany" & !is.na(state3) ~ state3,
TRUE ~ NA_character_))
#> state1 country1 state2 country2 state3 country3
#> 1 <NA> Spain <NA> Germany <NA> Spain
#> 2 <NA> Spain <NA> Germany <NA> Spain
#> 3 Beijing China Beijing China <NA> Germany
#> 4 Beijing China <NA> Germany <NA> Germany
#> 5 Schleswig-Holstein Germany <NA> Schleswig-Holstein Germany
#> 6 Moskva Russia <NA> Ukraine <NA> Germany
#> 7 <NA> Germany Moskva Russia <NA> Germany
#> 8 Moskva Russia <NA> Germany <NA> Germany
#> 9 <NA> Germany <NA> Ukraine <NA> Germany
#> 10 Berlin Germany <NA> Berlin Germany
#> GE_state
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 Schleswig-Holstein
#> 6 <NA>
#> 7 <NA>
#> 8 <NA>
#> 9 <NA>
#> 10 Berlin
Created on 2021-03-31 by the reprex package (v1.0.0)
I think you want cbind() here:
df1 <- cbind(df1, df2)
Data:
df1 <- <your first data frame>
df2 <- data.frame("GE_State"=c(NA,NA,NA,NA, "Schleswig-Holstein",NA,NA,NA,NA,"Berlin"))
I have a dataset (df1) on hundreds of national crises, where each observation is a crisis event at the country level with a start and an end date. I also have the date when the crisis was announced (yyyy-mm-dd format), and a bunch of other crisis characteristics.
df1 <- data.frame(cbind(eventID=c(1,2,3,4), country=c("ALB","ALB","ARG","ARG"), start=c(1994, 1998, 1998, 1991), end=c(1996,1999,1999,1993), announcement=c("1994-11-01","1998-03-01","1998-07-01","1992-01-01"), x1=c(6,2,8,7), x2=c("a","q","k","b")))
eventID country start end announcement x1 x2
1 ALB 1994 1996 1994-11-01 6 a
2 ALB 1998 1999 1998-03-01 2 q
3 ARG 1998 1999 1998-07-01 8 k
4 ARG 1991 1993 1992-01-01 7 b
I need to make df2, a panel of countries with annual observations from the earliest "start" year to the latest "end" year. I want to have a dummy variable, "crisis", that equals 1 for the years between "start" and "end" in df1, and 0 otherwise. I want "announcement" to contain the announcement date in df1 for the year with an announcement, and "NA" otherwise. I would like the extra crisis characteristics, x1 and x2, to show up for crisis years to which they correspond, and "NA" otherwise.
I also need observations for each country for years in which no country has a crisis (in df2: 1997).
df2 <- data.frame(cbind(year=c(1991,1992,1993,1994,1995,1996,1997,1998,1999,1991,1992,1993,1994,1995,1996,1997,1998,1999), country=c("ALB","ALB","ALB","ALB","ALB","ALB","ALB","ALB","ALB","ARG","ARG","ARG","ARG","ARG","ARG","ARG","ARG","ARG"),crisis=c(0,0,0,1,1,1,0,1,1,1,1,1,0,0,0,0,1,1), announcement=c(NA, NA,NA,"1994-11-01",NA,NA,NA,"1998-03-01",NA,NA,"1992-01-01",NA,NA,NA,NA,NA,"1998-07-01"), x1=c(NA,NA,NA,6,6,6,NA,2,2,8,8,8,NA,NA,NA,NA,7,7), x2=c(NA,NA,NA,"a","a","a",NA,"q","q","k","k","k",NA,NA,NA,NA,"b","b")))
year country crisis announcement x1 x2
1991 ALB 0 NA NA NA
1992 ALB 0 NA NA NA
1993 ALB 0 NA NA NA
1994 ALB 1 1994-11-01 6 a
1995 ALB 1 NA 6 a
1996 ALB 1 NA 6 a
1997 ALB 0 NA NA NA
1998 ALB 1 1998-03-01 2 q
1999 ALB 1 NA 2 q
1991 ARG 1 NA 8 k
1992 ARG 1 1992-01-01 8 k
1993 ARG 1 NA 8 k
1994 ARG 0 NA NA NA
1995 ARG 0 NA NA NA
1996 ARG 0 NA NA NA
1997 ARG 0 NA NA NA
1998 ARG 1 1998-07-01 7 b
1999 ARG 1 NA 7 b
I would love any suggestions! I'm stumped as to how to replicate the observations for each year, but only include x1 and x2 values when my new "crisis" dummy = 1
Thanks!
Making use of dplyr and tidyr this could be achieved like so:
library(dplyr)
library(tidyr)
df1 <- data.frame(cbind(eventID=c(1,2,3,4), country=c("ALB","ALB","ARG","ARG"), start=c(1994, 1998, 1998, 1991), end=c(1996,1999,1999,1993), announcement=c("1994-11-01","1998-03-01","1998-07-01","1992-01-01"), x1=c(6,2,8,7), x2=c("a","q","k","b")))
df1 %>%
mutate(year = factor(start, levels = min(start):max(end))) %>%
complete(year, country) %>%
mutate(year = as.numeric(as.character(year))) %>%
arrange(country, year) %>%
group_by(country) %>%
fill(eventID, end, x1, x2) %>%
ungroup() %>%
mutate(across(c(eventID, end, x1, x2), ~ ifelse(end < year, NA, .)),
crisis = as.numeric(!is.na(eventID)))
#> # A tibble: 18 x 9
#> year country eventID start end announcement x1 x2 crisis
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 1991 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 2 1992 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 3 1993 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 4 1994 ALB 1 1994 1996 1994-11-01 6 a 1
#> 5 1995 ALB 1 <NA> 1996 <NA> 6 a 1
#> 6 1996 ALB 1 <NA> 1996 <NA> 6 a 1
#> 7 1997 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 8 1998 ALB 2 1998 1999 1998-03-01 2 q 1
#> 9 1999 ALB 2 <NA> 1999 <NA> 2 q 1
#> 10 1991 ARG 4 1991 1993 1992-01-01 7 b 1
#> 11 1992 ARG 4 <NA> 1993 <NA> 7 b 1
#> 12 1993 ARG 4 <NA> 1993 <NA> 7 b 1
#> 13 1994 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 14 1995 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 15 1996 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 16 1997 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 17 1998 ARG 3 1998 1999 1998-07-01 8 k 1
#> 18 1999 ARG 3 <NA> 1999 <NA> 8 k 1
A data frame has 3 columns
-----------------------------------------
| Id | Country | Date |
-----------------------------------------
The 3 columns record the travel history of the person.
3 more columns need to be created representing the rolling top 3 countries this person (ID) has travelled to the most often before the date on the row.
(If tie appears for 2 countries, the latest travelled country has the precedence.)
mydata <- data.frame(ID = c('A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2'),
Country = c('Japan', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Japan', 'France', 'UK', 'Spain', 'Spain', 'UK', 'UK', 'Brazil'),
Date = as.Date(c('2010/01/02', '2010/04/18', '2011/03/22', '2011/11/23', '2012/05/09', '2012/09/11', '2014/01/06', '2015/12/11', '2010/04/03', '2010/05/11', '2011/05/01', '2012/03/01', '2013/01/03', '2014/01/04')))
# final data should look like below
#ID Country Date Pref1 Pref2 Pref3
#A1B1 Japan 2010-01-02 NA NA NA
#A1B1 USA 2010-04-18 Japan NA NA
#A1B1 USA 2011-03-22 USA Japan NA
#A1B1 USA 2011-11-23 USA Japan NA
#A1B1 Germany 2012-05-09 USA Japan NA
#A1B1 Germany 2012-09-11 USA Germany Japan
#A1B1 Japan 2014-01-06 USA Germany Japan
#A1B1 France 2015-12-11 USA Japan Germany
#A2B2 UK 2010-04-03 NA NA NA
#A2B2 Spain 2010-05-11 UK NA NA
#A2B2 Spain 2011-05-01 Spain UK NA
#A2B2 UK 2012-03-01 Spain UK NA
#A2B2 UK 2013-01-03 UK Spain NA
#A2B2 Brazil 2014-01-04 UK Spain NA
Q. How to create the last 3 columns for rolling top 3 countries in counts by ID?
Here is a way taking last 3 unique countries at each row for each ID.
library(dplyr)
mydata %>%
group_by(ID) %>%
mutate(data = purrr::map(row_number(), ~{
un_country <- Country[seq_len(.x - 1)]
if(.x == 1) un_country <- NA
else un_country <- names(sort(table(un_country), decreasing = TRUE))[1:3]
data.frame(t(un_country[1:3]))
})) %>%
tidyr::unnest_wider(data)
# ID Country Date X1 X2 X3
# <chr> <chr> <date> <chr> <chr> <chr>
# 1 A1B1 Japan 2010-01-02 NA NA NA
# 2 A1B1 USA 2010-04-18 Japan NA NA
# 3 A1B1 USA 2011-03-22 Japan USA NA
# 4 A1B1 USA 2011-11-23 USA Japan NA
# 5 A1B1 Germany 2011-05-09 USA Japan NA
# 6 A1B1 Germany 2012-09-11 USA Germany Japan
# 7 A1B1 Japan 2014-01-06 USA Germany Japan
# 8 A1B1 France 2015-12-11 USA Germany Japan
# 9 A2B2 UK 2010-04-03 NA NA NA
#10 A2B2 Spain 2010-05-11 UK NA NA
#11 A2B2 Spain 2011-05-01 Spain UK NA
#12 A2B2 UK 2012-03-01 Spain UK NA
#13 A2B2 UK 2013-01-03 Spain UK NA
#14 A2B2 Brazil 2014-01-04 UK Spain NA
I think this does it. I've included the mydata here as I think there was a typo in one of the dates.
mydata <- data.frame(ID = c('A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A1B1', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2', 'A2B2'),
Country = c('Japan', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Japan', 'France', 'UK', 'Spain', 'Spain', 'UK', 'UK', 'Brazil'),
Date = as.Date(c('2010/01/02', '2010/04/18', '2011/03/22', '2011/11/23', '2012/05/09', '2012/09/11', '2014/01/06', '2015/12/11', '2010/04/03', '2010/05/11', '2011/05/01', '2012/03/01', '2013/01/03', '2014/01/04')))
library(data.table)
setDT(mydata)
mydata[order(Date), `:=`(num_v = seq_len(.N), last_v = Date), .(ID, Country)]
x <- mydata[
mydata[, CJ(Country = unique(Country), Date = unique(Date)), ID],
on=c('ID', 'Country', 'Date'), roll=Inf]
x[, `:=`(num_v = shift(num_v), last_v = shift(last_v)), .(ID, Country)]
x[is.na(num_v), Country := NA]
y <- x[,
.SD[order(-num_v, -last_v)][1:3, .(Pref = paste0('Pref',1:3), Country)],
.(ID, Date)]
dcast(y, ID+Date~Pref, value.var = 'Country')
#> ID Date Pref1 Pref2 Pref3
#> 1: A1B1 2010-01-02 <NA> <NA> <NA>
#> 2: A1B1 2010-04-18 Japan <NA> <NA>
#> 3: A1B1 2011-03-22 USA Japan <NA>
#> 4: A1B1 2011-11-23 USA Japan <NA>
#> 5: A1B1 2012-05-09 USA Japan <NA>
#> 6: A1B1 2012-09-11 USA Germany Japan
#> 7: A1B1 2014-01-06 USA Germany Japan
#> 8: A1B1 2015-12-11 USA Japan Germany
#> 9: A2B2 2010-04-03 <NA> <NA> <NA>
#> 10: A2B2 2010-05-11 UK <NA> <NA>
#> 11: A2B2 2011-05-01 Spain UK <NA>
#> 12: A2B2 2012-03-01 Spain UK <NA>
#> 13: A2B2 2013-01-03 UK Spain <NA>
#> 14: A2B2 2014-01-04 UK Spain <NA>
You can join back on the Country from the original mydata if you need it.
This isn't a super clean answer. Hopefully it helps you gets you close.
library(readr)
df <- readr::read_table(
"ID Country Date
A1B1 Japan 2010-01-02
A1B1 USA 2010-04-18
A1B1 USA 2011-03-22
A1B1 USA 2011-11-23
A1B1 Germany 2012-05-09
A1B1 Germany 2012-09-11
A1B1 Japan 2014-01-06
A1B1 France 2015-12-11
A2B2 UK 2010-04-03
A2B2 Spain 2010-05-11
A2B2 Spain 2011-05-01
A2B2 UK 2012-03-01
A3B2 UK 2013-01-03
A3B2 Brazil 2014-01-04")
df
library(tidyverse)
rankings <- df %>%
group_by(ID, Country) %>%
summarise(obs = n(),
last_dt = max(Date)) %>%
arrange(ID,-obs, desc(last_dt)) %>%
mutate(rank = 1:n()) %>% print() %>%
filter(rank <= 3) %>%
pivot_wider(
names_from = rank,
values_from = Country,
names_prefix = "rank_",
id_cols = ID
) %>% print()
#> `summarise()` regrouping output by 'ID' (override with `.groups` argument)
#> # A tibble: 8 x 5
#> # Groups: ID [3]
#> ID Country obs last_dt rank
#> <chr> <chr> <int> <date> <int>
#> 1 A1B1 USA 3 2011-11-23 1
#> 2 A1B1 Japan 2 2014-01-06 2
#> 3 A1B1 Germany 2 2012-09-11 3
#> 4 A1B1 France 1 2015-12-11 4
#> 5 A2B2 UK 2 2012-03-01 1
#> 6 A2B2 Spain 2 2011-05-01 2
#> 7 A3B2 Brazil 1 2014-01-04 1
#> 8 A3B2 UK 1 2013-01-03 2
#> # A tibble: 3 x 4
#> # Groups: ID [3]
#> ID rank_1 rank_2 rank_3
#> <chr> <chr> <chr> <chr>
#> 1 A1B1 USA Japan Germany
#> 2 A2B2 UK Spain <NA>
#> 3 A3B2 Brazil UK <NA>
df %>% left_join(rankings, by = "ID")
#> # A tibble: 14 x 6
#> ID Country Date rank_1 rank_2 rank_3
#> <chr> <chr> <date> <chr> <chr> <chr>
#> 1 A1B1 Japan 2010-01-02 USA Japan Germany
#> 2 A1B1 USA 2010-04-18 USA Japan Germany
#> 3 A1B1 USA 2011-03-22 USA Japan Germany
#> 4 A1B1 USA 2011-11-23 USA Japan Germany
#> 5 A1B1 Germany 2012-05-09 USA Japan Germany
#> 6 A1B1 Germany 2012-09-11 USA Japan Germany
#> 7 A1B1 Japan 2014-01-06 USA Japan Germany
#> 8 A1B1 France 2015-12-11 USA Japan Germany
#> 9 A2B2 UK 2010-04-03 UK Spain <NA>
#> 10 A2B2 Spain 2010-05-11 UK Spain <NA>
#> 11 A2B2 Spain 2011-05-01 UK Spain <NA>
#> 12 A2B2 UK 2012-03-01 UK Spain <NA>
#> 13 A3B2 UK 2013-01-03 Brazil UK <NA>
#> 14 A3B2 Brazil 2014-01-04 Brazil UK <NA>
Created on 2020-08-29 by the reprex package (v0.3.0)
Here's a messy Base R solution:
rlln_rnk_df <- do.call("rbind", lapply(split(mydata, mydata$ID), function(x){
y <- do.call("rbind", lapply(seq_len(nrow(x)), function(i){
tmp <- x[x$Date <= x$Date[i],]
tmp1 <- cbind(head(tmp[order(tmp$Date, decreasing = TRUE),], 1),
rnk = t(names(sort(table(tmp$Country), decreasing = TRUE))))
tmp1 <- setNames(tmp1, c(names(tmp), paste0("rnk.", 1:(ncol(tmp1) - ncol(tmp)))))
tmp1[,setdiff(paste0("rnk.", 1:(length(unique(mydata$Country)))), names(tmp1))] <- NA_character_
tmp1
}
)
)
z <- y[order(y$Date),]
cbind(ID = z$ID, Country = z$Country, Date = z$Date,
z[match(z$Date, z$Date[2:nrow(z)]), (grep("rnk", names(z), value = TRUE))])
}
)
)
df_clean <- data.frame(rlln_rnk_df[, colSums(is.na(rlln_rnk_df)) < nrow(rlln_rnk_df)],
row.names = NULL)
I would like to know how I can score my dataframe based on values found with grep().
Say I got a DF Containing this:
age=c("France","Mars","Jupitor","Moon","Sun","Afrika","Texas","Michigan","Washington","Kiev","Amsterdam","Norway")
height=c("Paris","Planet","Planet","COLD","HOT!","LIONS","Austin","Lansing","WashingtonDC","Ukrain","Holland","Oslo")
village=data.frame(age=age,height=height)
and I use grep('Moon',village$age, ignore.case=TRUE) to search which row it is on.
How can you add a column in front of age, to score it with in example, the number 1,
if I use grep('FRANCE',village$age, ignore.case=TRUE) to score it with the number 2?
You didn't specify what the non-found "scores" should be, so the following just uses NA's:
age <- c("France","Mars","Jupitor","Moon","Sun","Afrika",
"Texas","Michigan","Washington","Kiev","Amsterdam","Norway")
height <- c("Paris","Planet","Planet","COLD","HOT!","LIONS",
"Austin","Lansing","WashingtonDC","Ukrain","Holland","Oslo")
village <- data.frame(score=NA, age=age, height=height)
print(village)
## score age height
## 1 NA France Paris
## 2 NA Mars Planet
## 3 NA Jupitor Planet
## 4 NA Moon COLD
## 5 NA Sun HOT!
## 6 NA Afrika LIONS
## 7 NA Texas Austin
## 8 NA Michigan Lansing
## 9 NA Washington WashingtonDC
## 10 NA Kiev Ukrain
## 11 NA Amsterdam Holland
## 12 NA Norway Oslo
village[grep('moon', village$age, ignore.case=TRUE),]$score <- 1
village[grep('france', village$age, ignore.case=TRUE),]$score <- 2
print(village)
## score age height
## 1 2 France Paris
## 2 NA Mars Planet
## 3 NA Jupitor Planet
## 4 1 Moon COLD
## 5 NA Sun HOT!
## 6 NA Afrika LIONS
## 7 NA Texas Austin
## 8 NA Michigan Lansing
## 9 NA Washington WashingtonDC
## 10 NA Kiev Ukrain
## 11 NA Amsterdam Holland
## 12 NA Norway Oslo