How to manually enter a cell in a dataframe? [duplicate] - r

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.

This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000

You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555

If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

Related

R Dataframe Detecting Hidden Repeated Patterns by Group

I have a dataframe looks like below:
person year location rank
Harry 2002 Los Angeles 1
Harry 2006 Boston 1
Harry 2006 Los Angeles 2
Harry 2006 Chicago 3
Peter 2001 New York 1
Peter 2002 New York 1
Lily 2005 Springfield 1
Lily 2007 New York 1
Lily 2008 Boston 1
Lily 2011 Chicago 1
Lily 2011 New York 2
Sam 2005 Springfield 1
Sam 2007 New York 1
Sam 2008 Boston 1
Sam 2008 Springfield 2
Sam 2008 New York 3
Sam 2011 Chicago 1
Sam 2011 Springfield 2
I want to know at person level, who has a location with rank=1 in a certain year and this location reappears in the next available year but rank!=1. For example, the output should look like:
person yes/no
Harry 1
Peter 0
Lily 0
Sam 1
Here's an approach with dplyr, probably could be more concise.
library(dplyr)
df1 %>%
# define year_number as a count of unique years [assumes sorted already]
group_by(person) %>%
mutate(year_num = cumsum(year != lag(year, default = 0))) %>%
# check for successive years with different ranks
group_by(person, location) %>%
mutate(next_yr_switch = year_num == lag(year_num, default = -Inf) + 1 & rank != lag(rank)) %>%
group_by(person) %>%
summarize(`yes/no` = sum(next_yr_switch))
## A tibble: 4 x 2
# person `yes/no`
#* <chr> <int>
#1 Harry 1
#2 Lily 0
#3 Peter 0
#4 Sam 1

Summary output to independent dataset

Im working with a twitter dataset i got with rtweet. I worked to create a state variable based on the coordinates (when available).
my output is this so far
> summary(rt1$state)
alabama arizona arkansas california colorado connecticut
3 6 2 104 5 1
delaware district of columbia florida georgia idaho illinois
1 0 17 7 0 12
indiana iowa kansas kentucky louisiana maine
4 1 2 3 2 1
maryland massachusetts michigan minnesota mississippi missouri
1 2 9 6 0 2
montana nebraska nevada new hampshire new jersey new mexico
0 3 5 1 4 7
new york north carolina north dakota ohio oklahoma oregon
25 8 1 3 2 4
pennsylvania rhode island south carolina south dakota tennessee texas
22 0 2 1 3 35
utah vermont virginia washington west virginia wisconsin
2 1 3 5 0 2
wyoming NA's
1 17669
can you please advise on how can i create an independent dataset from the output above so i have 2 columns (state and n) ?
thanks
We can wrap with stack to create a two column data.frame from the OP's code
out <- stack(summary(rt1$state))[2:1]
names(out) <- c("state", "n")
Or another option in base R is
as.data.frame(table(rt1$state))
A reproducible example
data(iris)
out <- stack(summary(iris$Species))[2:1]
Or with table
as.data.frame(table(iris$Species))
Or enframe from tibble
library(tibble)
library(tidyr)
enframe(summary(rt1$state)) %>%
unnest(c(value))
Or maybe you can work directly on your rt1 dataframe:
dplyr::count(rt1, state)

String comparison in 2 adjacent rows of a data frame in R

I have a data-frame with 627 observations and 16 variables are present. I am considering one column named "ZoneDivison" which has factors: North Eastern, Eastern and South Eastern.
So, I want to compare the adjacent row values and create a new column which has 1, if two adjacent rows have same zones, else 0, if the adjacent rows are different.
I referred to the following links to find a way out:
[here] Matching two Columns in R
[here] compare row values over multiple rows (R)
library(dplyr)
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
a=data.frame(a)
colnames(a)="ZoneDivision"
#comparing the zones
library(plyr)
ddply(n, .(ZoneDivision),summarize,ZoneMatching=Position(isTRUE,ZoneDivision))
Expected Result
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1
Actual Result
ZoneDivision ZoneMatching
1 Eastern NA
2 North Eastern NA
3 South Eastern NA
How should I proceed? Please help!!
Using base R, we can do
as.numeric(c(NA, a$ZoneDivision[-1] == a$ZoneDivision[-nrow(a)]))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1
The data.table way:
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
dt <- as.data.table(a)
dt[,'ZoneMatching' := as.numeric(.SD[,a] == shift(.SD[,a],1))]
Where you add a new ZoneMatching column as the numeric values of the logical comparison between the a column and the lagged values, generated by the shift() function.
You can use lag to get that:
library(dplyr)
a %>%
mutate(ZoneMatching = as.numeric((ZoneDivision == lag(ZoneDivision, 1))))
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1
We can use base R
with(a, c(NA, +(head(ZoneDivision, -1) == tail(ZoneDivision, -1))))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1

Sum up values with same ID from different column in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!

Add new column to long dataframe from another dataframe?

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

Resources