String comparison in 2 adjacent rows of a data frame in R - r

I have a data-frame with 627 observations and 16 variables are present. I am considering one column named "ZoneDivison" which has factors: North Eastern, Eastern and South Eastern.
So, I want to compare the adjacent row values and create a new column which has 1, if two adjacent rows have same zones, else 0, if the adjacent rows are different.
I referred to the following links to find a way out:
[here] Matching two Columns in R
[here] compare row values over multiple rows (R)
library(dplyr)
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
a=data.frame(a)
colnames(a)="ZoneDivision"
#comparing the zones
library(plyr)
ddply(n, .(ZoneDivision),summarize,ZoneMatching=Position(isTRUE,ZoneDivision))
Expected Result
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1
Actual Result
ZoneDivision ZoneMatching
1 Eastern NA
2 North Eastern NA
3 South Eastern NA
How should I proceed? Please help!!

Using base R, we can do
as.numeric(c(NA, a$ZoneDivision[-1] == a$ZoneDivision[-nrow(a)]))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1

The data.table way:
a <- c(rep("Eastern",3),rep ("North Eastern", 6),rep("South Eastern", 3))
dt <- as.data.table(a)
dt[,'ZoneMatching' := as.numeric(.SD[,a] == shift(.SD[,a],1))]
Where you add a new ZoneMatching column as the numeric values of the logical comparison between the a column and the lagged values, generated by the shift() function.

You can use lag to get that:
library(dplyr)
a %>%
mutate(ZoneMatching = as.numeric((ZoneDivision == lag(ZoneDivision, 1))))
ZoneDivision ZoneMatching
1 Eastern NA
2 Eastern 1
3 Eastern 1
4 North Eastern 0
5 North Eastern 1
6 North Eastern 1
7 North Eastern 1
8 North Eastern 1
9 North Eastern 1
10 South Eastern 0
11 South Eastern 1
12 South Eastern 1

We can use base R
with(a, c(NA, +(head(ZoneDivision, -1) == tail(ZoneDivision, -1))))
#[1] NA 1 1 0 1 1 1 1 1 0 1 1

Related

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

Sum up values with same ID from different column in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!

How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

Subsetting two factors in R

I have a huge dataset, and I have a column called season. There are 4 seasons i.e. Winter, Spring, Summer and Autumn.
Region Year Male Female Area DATE Day Month Season
WEST 1996 0 1 4 06-04-96 Saturday April Spring
EAST 1996 0 1 16 29-06-96 Saturday June Summer
WEST 1996 0 1 4 19-10-96 Saturday October Winter
WEST 1996 0 1 4 20-10-96 Sunday October Winter
EAST 1996 0 1 16 01-11-96 Friday November Winter
EAST 1996 0 1 16 11-11-96 Monday November Winter
WEST 1996 0 1 4 19-11-96 Tuesday November Winter
WEST 1996 0 1 4 28-11-96 Thursday November Winter
WEST 1996 0 1 4 10-12-96 Tuesday December Winter
WEST 1997 0 1 4 17-01-97 Friday January Winter
WEST 1997 0 1 4 28-03-97 Friday March Spring
So I am trying to create a subset where I want R to show me entries with season as Winter and Autumn.
I created a subset first of the portion I want.
secondphase<-subset(eb1, Area>16)
now from this subset, I want where Season is Winter and Autumn.
I tried these codes-
th2<-subset(secondphase, Season== "Winter")
th3<-subset(secondphase, Season=="Autumn")
Now is there a way to merge these two subsets? or create a subset where I can select the conditions where I want area>16, season should be Winter and autumn.
Thanks for the Help.
You could also use the dplyr package with the filter function
filter(secondphase, grepl("Winter|Autumn", Season))
Method 1
my_subset <- eb1[eb1$Season %in% c("Winter", "Autumn") & eb1$Area > 16,]
Method 2
th2 <- subset(secondphase, Season== "Winter")
th3 <- subset(secondphase, Season=="Autumn")
final <- rbind(th2, th3)
Method 3
final <-subset(eb1[eb1$Area > 16,], Season== "Winter" | Season=="Autumn")
With a data.table approach,
library("data.table")
DT<-data.table(eb1)
subsetDT<-subset(DT, Season %in% c("Autmn","Winter") & Area > 16)
does the job.

How to create timeseries by grouping entries in R?

I want to create a time series from 01/01/2004 until 31/12/2010 of daily mortality data in R. The raw data that I have now (.csv file), has as columns day - month - year and every row is a death case. So if the mortality on a certain day is for example equal to four, there are four rows with that date. If there is no death case reported on a specific day, that day is omitted in the dataset.
What I need is a time-series with 2557 rows (from 01/01/2004 until 31/12/2010) wherein the total number of death cases per day is listed. If there is no death case on a certain day, I still need that day to be in the list with a "0" assigned to it.
Does anyone know how to do this?
Thanks,
Gosia
Example of the raw data:
day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004
What I need:
day month year deaths
1 1 2004 1
2 1 2004 0
3 1 2004 3
4 1 2004 0
5 1 2004 0
6 1 2004 1
df <- read.table(text="day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004",header=TRUE)
#transform to dates
dates <- as.Date(with(df,paste(year,month,day,sep="-")))
#contingency table
tab <- as.data.frame(table(dates))
names(tab)[2] <- "deaths"
tab$dates <- as.Date(tab$dates)
#sequence of dates
res <- data.frame(dates=seq(from=min(dates),to=max(dates),by="1 day"))
#merge
res <- merge(res,tab,by="dates",all.x=TRUE)
res[is.na(res$deaths),"deaths"] <- 0
res
# dates deaths
#1 2004-01-01 1
#2 2004-01-02 0
#3 2004-01-03 3
#4 2004-01-04 0
#5 2004-01-05 0
#6 2004-01-06 1
#7 2004-01-07 1

Resources