Extracting hidden Info from R Dataframe

Extracting hidden Info from R Dataframe - r

I have a dataframe looks like below:
person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2001 Chicago COO 2
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
Sam 2006 Chicago COO 0
Sam 2007 Chicago CFO 0
Sam 2007 Chicago CEO 1
Sam 2010 New York Advisor 0
I want to know at a person level, who has at least one of the following two patterns:
in a previous available year, an office has rank 0 and in the next available year, the office still exist but rank is no longer and should be bigger than 0 (job does not matter). For example, Los Angeles for Harry.
in a next available year, an office has rank 0 and in the previous available year, the office still exist but rank is is no longer and should be bigger than 0 (For example, Chicago for Peter).
Note that New York for Lily does not have either of the above situation as 2007 is not the previous available year for Lily (2008 is).
Note that an office can exist multiple times in a year (differ in jobs). Chicago for Sam shows one such case. Note that Chicago for Sam also does not count as although Chicago has rank 1 in 2007 and rank 0 in previous available year, Chicago also has rank 0 in 2007.
Thus, the output should look like:
person yes/no
Harry 1
Peter 1
Lily 0
Sam 0

If I understand correctly, I think this will work. You want to:
Figure out if any job in a person-year-office are rank 0
For each person-office, check the two cases you're interested in (current row has a rank 0, and either the previous or the next does not have a rank 0. This is easier to do if you expand the dataframe to include all combinations for each person-year for each office.
For each person, check if any row matches either case you specified and fill the missing values.
library(tidyverse)
df <- read_table(
"person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2001 Chicago COO 2
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
Sam 2006 Chicago COO 0
Sam 2007 Chicago CFO 0
Sam 2007 Chicago CEO 1
Sam 2010 New York Advisor 0
"
)
df %>%
group_by(person, year, Office) %>%
summarise(any_rank_0 = any(rank == 0)) %>%
ungroup() %>%
complete(nesting(person, year), Office) %>%
arrange(person, Office, year) %>%
group_by(person, Office) %>%
mutate(
case_1 = any_rank_0 & !lead(any_rank_0), #current 0, next not 0
case_2 = any_rank_0 & !lag(any_rank_0) #current 0, previous not 0
) %>%
group_by(person) %>%
summarise(result = replace_na(any(case_1) | any(case_2), FALSE))
#> # A tibble: 4 x 2
#> person result
#> <chr> <lgl>
#> 1 Harry TRUE
#> 2 Lily FALSE
#> 3 Peter TRUE
#> 4 Sam FALSE
Created on 2021-05-20 by the reprex package (v1.0.0)

Related

R Dataframe Detecting Hidden Repeated Patterns by Group

I have a dataframe looks like below:
person year location rank
Harry 2002 Los Angeles 1
Harry 2006 Boston 1
Harry 2006 Los Angeles 2
Harry 2006 Chicago 3
Peter 2001 New York 1
Peter 2002 New York 1
Lily 2005 Springfield 1
Lily 2007 New York 1
Lily 2008 Boston 1
Lily 2011 Chicago 1
Lily 2011 New York 2
Sam 2005 Springfield 1
Sam 2007 New York 1
Sam 2008 Boston 1
Sam 2008 Springfield 2
Sam 2008 New York 3
Sam 2011 Chicago 1
Sam 2011 Springfield 2
I want to know at person level, who has a location with rank=1 in a certain year and this location reappears in the next available year but rank!=1. For example, the output should look like:
person yes/no
Harry 1
Peter 0
Lily 0
Sam 1

Here's an approach with dplyr, probably could be more concise.
library(dplyr)
df1 %>%
# define year_number as a count of unique years [assumes sorted already]
group_by(person) %>%
mutate(year_num = cumsum(year != lag(year, default = 0))) %>%
# check for successive years with different ranks
group_by(person, location) %>%
mutate(next_yr_switch = year_num == lag(year_num, default = -Inf) + 1 & rank != lag(rank)) %>%
group_by(person) %>%
summarize(`yes/no` = sum(next_yr_switch))
## A tibble: 4 x 2
# person `yes/no`
#* <chr> <int>
#1 Harry 1
#2 Lily 0
#3 Peter 0
#4 Sam 1

R - clean up data based on preceding and following values

I have got a table which is later on divided into multiple intervals based on multiple conditions. In some rare cases, I one or multiple rows which do not fall into the defined interval, so I'd like to preform some extra clean-up in the data.
For each group (name, location), if the row value in stop == 0, I need to count how many of those rows are in the interval. If that less then <3, I need to check how many continous rows are market as stop == 1 above and below the interval with zero value. If the count of values with stop == 1 above & below == 1 then I need to change values in the intervals with zero to 1.
I hope the picture will make it more clear:
df <- read.table(text="name location stop
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John London 0
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = FALSE)

You could iterate over the rows, but it seems that all you want to do is replace all instances of 101 with 111 and 1001 with 1111 in stop. You can do this by turning the stop column to string and then make substitutions using gsub():
stopString = paste0(df$stop, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
df$stop = as.numeric(unlist(strsplit(stopString,"")))
> df
name location stop
1 John London 1
2 John London 1
3 John London 1
4 John London 1
5 John London 1
6 John London 1
7 John London 1
8 John London 1
9 John London 1
10 John London 1
11 John London 1
12 John London 1
13 John London 1
14 John London 1
15 John London 1
16 John London 0
17 John New_York 0
18 John New_York 0
19 John New_York 0
20 John New_York 1
21 John New_York 0
Edit: grouping by name and location:
df <- read.table(text="name location stop
John London 1
John London 0
John London 1
John New_York 0
John New_York 1
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = TRUE)
f <- function(x)
{
stopString = paste0(x, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
as.numeric(unlist(strsplit(stopString,"")))
}
> df %>% dplyr::group_by(name, location) %>%
dplyr::summarise(stop=stop, s=f(stop))
# A tibble: 10 x 4
# Groups: name, location [2]
name location stop s
<fct> <fct> <int> <dbl>
1 John London 1 1
2 John London 0 1
3 John London 1 1
4 John New_York 0 0
5 John New_York 1 1
6 John New_York 0 0
7 John New_York 0 0
8 John New_York 0 0
9 John New_York 1 1
10 John New_York 0 0

Summary output to independent dataset

Im working with a twitter dataset i got with rtweet. I worked to create a state variable based on the coordinates (when available).
my output is this so far
> summary(rt1$state)
alabama arizona arkansas california colorado connecticut
3 6 2 104 5 1
delaware district of columbia florida georgia idaho illinois
1 0 17 7 0 12
indiana iowa kansas kentucky louisiana maine
4 1 2 3 2 1
maryland massachusetts michigan minnesota mississippi missouri
1 2 9 6 0 2
montana nebraska nevada new hampshire new jersey new mexico
0 3 5 1 4 7
new york north carolina north dakota ohio oklahoma oregon
25 8 1 3 2 4
pennsylvania rhode island south carolina south dakota tennessee texas
22 0 2 1 3 35
utah vermont virginia washington west virginia wisconsin
2 1 3 5 0 2
wyoming NA's
1 17669
can you please advise on how can i create an independent dataset from the output above so i have 2 columns (state and n) ?
thanks

We can wrap with stack to create a two column data.frame from the OP's code
out <- stack(summary(rt1$state))[2:1]
names(out) <- c("state", "n")
Or another option in base R is
as.data.frame(table(rt1$state))
A reproducible example
data(iris)
out <- stack(summary(iris$Species))[2:1]
Or with table
as.data.frame(table(iris$Species))
Or enframe from tibble
library(tibble)
library(tidyr)
enframe(summary(rt1$state)) %>%
unnest(c(value))

Or maybe you can work directly on your rt1 dataframe:
dplyr::count(rt1, state)

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.

This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000

You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555

If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

How to join/ merge two tables using character values?

I would like to combine two tables based on first name, last name, and year, and create a new binary variable indicating whether the row from table 1 was present in the 2nd table.
First table is a panel data set of some attributes of NBA players during a season:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
year<-c("1991","1992","1993","1991","1992","1993","1992","1992")
season<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
7 Larry Bird 1992
8 Larry Bird 1992
The second data.frame is a panel data set of some attributes of NBA players selected to the All-Star game:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
year<-c("1991","1992","1993","1991","1992","1993")
ALLSTARS<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
My desired result looks like:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
I tried to use a left join. But not sure whether that makes sense:
test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")

Here's a simple solution using data.table binary join which allows you to update a column by reference while joing
library(data.table)
setkey(setDT(season), firstname, lastname, year)[ALLSTARS, allstars := 1L]
season
# firstname lastname year allstars
# 1: Larry Bird 1992 NA
# 2: Larry Bird 1992 NA
# 3: Magic Johnson 1991 1
# 4: Magic Johnson 1992 1
# 5: Magic Johnson 1993 1
# 6: Michael Jordan 1991 1
# 7: Michael Jordan 1992 1
# 8: Michael Jordan 1993 1
Or using dplyr
library(dplyr)
ALLSTARS %>%
mutate(allstars = 1L) %>%
right_join(., season)
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 NA
# 8 Larry Bird 1992 NA

In base R:
ALLSTARS$allstars <- 1L
newdf <- merge(season, ALLSTARS, by=c('firstname', 'lastname', 'year'), all.x=TRUE)
newdf$allstars[is.na(newdf$allstars)] <- 0L
newdf
Or one I like for a different approach:
season$allstars <- (apply(season, 1, function(x) paste(x, collapse='')) %in%
apply(ALLSTARS, 1, function(x) paste(x, collapse='')))+0L
#
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 0
# 8 Larry Bird 1992 0

It looks like you are using join() from the plyr package. You were almost there: just preface your command with ALLSTARS$allstars <- 1. Then do your join as it is written and finally convert the NA values to 0. So:
ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0
Result:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
Though I personally would use left_join or right_join from the dplyr package, as in David's answer, instead of plyr's join(). Also note that you don't actually need the by argument of join() in this case as by default the function will try to join on all fields with common names, which is what you want here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting hidden Info from R Dataframe - r

Related

R Dataframe Detecting Hidden Repeated Patterns by Group

R - clean up data based on preceding and following values

Summary output to independent dataset

How to manually enter a cell in a dataframe? [duplicate]

How to join/ merge two tables using character values?

Categories

Resources