This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))
Related
I have a dataframe looks like below:
person year location rank
Harry 2002 Los Angeles 1
Harry 2006 Boston 1
Harry 2006 Los Angeles 2
Harry 2006 Chicago 3
Peter 2001 New York 1
Peter 2002 New York 1
Lily 2005 Springfield 1
Lily 2007 New York 1
Lily 2008 Boston 1
Lily 2011 Chicago 1
Lily 2011 New York 2
Sam 2005 Springfield 1
Sam 2007 New York 1
Sam 2008 Boston 1
Sam 2008 Springfield 2
Sam 2008 New York 3
Sam 2011 Chicago 1
Sam 2011 Springfield 2
I want to know at person level, who has a location with rank=1 in a certain year and this location reappears in the next available year but rank!=1. For example, the output should look like:
person yes/no
Harry 1
Peter 0
Lily 0
Sam 1
Here's an approach with dplyr, probably could be more concise.
library(dplyr)
df1 %>%
# define year_number as a count of unique years [assumes sorted already]
group_by(person) %>%
mutate(year_num = cumsum(year != lag(year, default = 0))) %>%
# check for successive years with different ranks
group_by(person, location) %>%
mutate(next_yr_switch = year_num == lag(year_num, default = -Inf) + 1 & rank != lag(rank)) %>%
group_by(person) %>%
summarize(`yes/no` = sum(next_yr_switch))
## A tibble: 4 x 2
# person `yes/no`
#* <chr> <int>
#1 Harry 1
#2 Lily 0
#3 Peter 0
#4 Sam 1
Im working with a twitter dataset i got with rtweet. I worked to create a state variable based on the coordinates (when available).
my output is this so far
> summary(rt1$state)
alabama arizona arkansas california colorado connecticut
3 6 2 104 5 1
delaware district of columbia florida georgia idaho illinois
1 0 17 7 0 12
indiana iowa kansas kentucky louisiana maine
4 1 2 3 2 1
maryland massachusetts michigan minnesota mississippi missouri
1 2 9 6 0 2
montana nebraska nevada new hampshire new jersey new mexico
0 3 5 1 4 7
new york north carolina north dakota ohio oklahoma oregon
25 8 1 3 2 4
pennsylvania rhode island south carolina south dakota tennessee texas
22 0 2 1 3 35
utah vermont virginia washington west virginia wisconsin
2 1 3 5 0 2
wyoming NA's
1 17669
can you please advise on how can i create an independent dataset from the output above so i have 2 columns (state and n) ?
thanks
We can wrap with stack to create a two column data.frame from the OP's code
out <- stack(summary(rt1$state))[2:1]
names(out) <- c("state", "n")
Or another option in base R is
as.data.frame(table(rt1$state))
A reproducible example
data(iris)
out <- stack(summary(iris$Species))[2:1]
Or with table
as.data.frame(table(iris$Species))
Or enframe from tibble
library(tibble)
library(tidyr)
enframe(summary(rt1$state)) %>%
unnest(c(value))
Or maybe you can work directly on your rt1 dataframe:
dplyr::count(rt1, state)
I have a data-frame with 627 observations and 16 variables are present. I am considering one column named "ZoneDivison" which has factors: North Eastern, Eastern and South Eastern.
So, I want to compare the adjacent row values and create a new column which has 1, if two adjacent rows have same zones, else 0, if the adjacent rows are different.
I referred to the following links to find a way out:
[here] Matching two Columns in R
[here] compare row values over multiple rows (R)
library(dplyr)
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
a=data.frame(a)
colnames(a)="ZoneDivision"
#comparing the zones
library(plyr)
ddply(n, .(ZoneDivision),summarize,ZoneMatching=Position(isTRUE,ZoneDivision))
Expected Result
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1
Actual Result
ZoneDivision ZoneMatching
1 Eastern NA
2 North Eastern NA
3 South Eastern NA
How should I proceed? Please help!!
Using base R, we can do
as.numeric(c(NA, a$ZoneDivision[-1] == a$ZoneDivision[-nrow(a)]))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1
The data.table way:
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
dt <- as.data.table(a)
dt[,'ZoneMatching' := as.numeric(.SD[,a] == shift(.SD[,a],1))]
Where you add a new ZoneMatching column as the numeric values of the logical comparison between the a column and the lagged values, generated by the shift() function.
You can use lag to get that:
library(dplyr)
a %>%
mutate(ZoneMatching = as.numeric((ZoneDivision == lag(ZoneDivision, 1))))
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1
We can use base R
with(a, c(NA, +(head(ZoneDivision, -1) == tail(ZoneDivision, -1))))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1
This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!
Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.