I have a large dataset which, simplified, looks something like this:
Year
Name
January
February
March
April
May
Street
2000
Bob
$100
$197
$124
$100
ABC
2000
Abe
$100
$100
$117
$123
$100
ABC
2001
Bob
$100
$100
$197
$103
$150
DEF
2001
Abe
$140
$100
$127
$526
$123
ABC
2002
Abe
$100
$100
$198
$102
$101
DEF
2002
Bob
$102
$110
ABC
2003
Carly
$100
$100
$197
ABC
I am trying to combine this data so that each person has one line, with the goal of counting and graphing how many months they paid in a row.
I was thinking of trying to recode the data so that each person gets their own row, with a timeline of how much they paid by year and season, with column names like this, but I am having trouble figuring out how to do that.
Name
2000 January
2000 February
2000 March
2000 April
2000 May
2001 January
2001 February
2001 March
2001 April
2001 May
2002 January
2002 February
2002 March
2002 April
2002 May
Street
Is there a way to condense variables in this way somehow?
Thank you so much!
Using pivot_wider from {tidyr} will achieve this. Calling your dataframe yeardata, you can do the following:
selectmonths <- c("January", "February", "March", "April", "May")
result <- yeardata %>%
pivot_wider(names_from = "Year", values_from = selectmonths)
Related
I would like to figure out who was the most recent previous owner at a location within the last two years before the current owner. The locations are called reflo (reference location). Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).
The conditions:
the previous owner has to have lived at the same location (lifetime_census$reflo==owners$reflo.x[i]) within two years of the current owner's year (lifetime_census$census_year <= 2 years of owners$spr_census)
if none, then assign NA
Previous owners (>20,000) are stored in a dataset called lifetime_census. Here is a sample of the data:
id previous_id reflo census_year
16161 5587 -310 2001
17723 5587 -310 2002
19345 5879 -310 2003
16848 5101 Q1 2001
17836 6501 Q1 2002
19439 6501 Q1 2003
21815 6057 Q1 2004
I then have an owners dataset (here is a sample):
squirrel_id spr_census reflo.x
6391 2005 Q1
6130 2005 -310
6288 2005 A12
To illustrate what I am trying to achieve:
squirrel_id spr_census reflo.x previous_owner census_year
6391 2005 Q1 6057 2004
6130 2005 -310 5879 2003
6288 2005 A12 NA NA
What I have currently tried is this:
n <- length(owners$squirrel_id)
for(i in 1:n) {
last_owner <- subset(lifetime_census,
life_census$previous_id!=owners$squirrel_id[i] & #previous owner != current owner
lifetime_census$reflo==owners$reflo.x[i] &
lifetime_census$census_year<=owners$spr_census[i]) #owners can be in current or past year
#Put it all together
owners[i,"spring_owner"] <- last_owner$previous_id[i]
}
This gives me a new column for the previous owner in any past year for reflo.x, adding NAs after all the conditions are not met. I cannot figure out how to restrict this search to the last two years.
Any ideas? (Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).)
To figure out who was the most recent previous owner at a location within the last two years before the current owner you can first arrange by date in descending order:
library(dplyr)
lifetime_census<-lifetime_census %>%
group_by(reflo) %>%
arrange(desc(census_date))
Which puts the most recent years first (similar to top_n):
id previous_owner reflo census_year
19345 5879 -310 2003
17723 5587 -310 2002
16161 5587 -310 2001
21815 6057 Q1 2004
19439 6501 Q1 2003
17836 6501 Q1 2002
16848 5101 Q1 2001
Then you can run the loop above:
n <- length(owners$squirrel_id)
for(i in 1:n) {
last_owner <- subset(lifetime_census,
life_census$squirrel_id != owners$squirrel_id[i] &
lifetime_census$reflo==owners$reflo.x[i] &
lifetime_census$census_year <= owners$spr_census[i]) #owners can be in current or past year
#Put it all together
owners[i,"previous_owner"] <- last_owner$previous_owner[i]
owners[i,"prev_census"] <- last_owner$census_year[i]
}
This will give you:
> head(owners)
> squirrel_id spr_census reflo.x previous_owner prev_census
<chr> <chr> <chr> <chr> <dbl>
6391 2005 Q1 6057 2004
6130 2005 -310 5879 2003
6288 2005 A12 <NA> <NA>
If, for example, an individual above had a match for a year <= 2 years from the spr_census year, you can fix this on a case by case basis (not the most elegant solution, but it's workable) by using an if_else statement, like so:
owners<-owners%>% mutate(spring_owner=ifelse(prev_census < 2003, "<NA>", spring_owner))
I have two different dataframes, each of which contains different texts by month. What I want to do is merging the texts that have the same date in one single dataframe.
Let me take an example to clarify. This is dataframe_A where the third column (Article) contains some text for each date:
Date Title Article
1 1 January 2000 PRESS CONFERENCE Article_topic_A_1
2 1 February 2000 PRESS CONFERENCE Article_topic_A_2
3 1 March 2000 PRESS CONFERENCE Article_topic_A_3
This is dataframe_B that contains different text but in the same date:
Date Title Article
1 1 January 2000 PRESS CONFERENCE Article_topic_B_1
2 1 February 2000 PRESS CONFERENCE Article_topic_B_2
3 1 March 2000 PRESS CONFERENCE Article_topic_B_3
Now, I want to combine the text of Article_topic_A_1 with the text of Article_topic_B_1, text of Article_topic_A_2 with the text of Article_topic_B_2, and so on. For the same date (e.g.: 1 January 2000), I want to combine different articles (e.g.: Article_topic_A_1 and Article_topic_B_1). Basically, the final dataframe needs to look like this:
Date Title Article
1 1 January 2000 PRESS CONFERENCE Article1
2 1 February 2000 PRESS CONFERENCE Article2
3 1 March 2000 PRESS CONFERENCE Article3
The third column will contain the merged texts that have been grouped by "date".
I tried to use merge and subset but I did not manage to do it.
Can you help me with it?
Thanks a lot!
Here's a solution using merge, with the text for both separated by ,.
df_a <- data.frame(
Date = c("1 January 2000", "1 February 2000", "1 March 2000"),
Title = rep("PRESS CONFERENCE", 3),
Article = c("Article_topic_A_1", "Article_topic_A_2", "Article_topic_A_3")
)
df_b <- data.frame(
Date = c("1 January 2000", "1 February 2000", "1 March 2000"),
Title = rep("PRESS CONFERENCE", 3),
Article = c("Article_topic_B_1", "Article_topic_B_2", "Article_topic_B_3")
)
df <- merge(df_a, df_b, by = c("Date", "Title"))
df$Article <- paste(df$Article.x, df$Article.y, sep = ", ")
df <- df[, !(names(df) %in% c("Article.x", "Article.y"))]
df
#> Date Title Article
#> 1 1 February 2000 PRESS CONFERENCE Article_topic_A_2, Article_topic_B_2
#> 2 1 January 2000 PRESS CONFERENCE Article_topic_A_1, Article_topic_B_1
#> 3 1 March 2000 PRESS CONFERENCE Article_topic_A_3, Article_topic_B_3
Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)
I have been trying to calculate the quarter over quarter change in shares with no luck. I have a data.table with approx 15millions rows.
What I need to calculate is the change in absolute values quarter by quarter according to the Holder and the stock they own.
My data table looks like this:
stock Holder Quarter Shares
1: GOOGLE Advance Capital Management, Inc. 2015 Q3 5800
2: GOOGLE Advance Capital Management, Inc. 2015 Q4 9000
3: GOOGLE Advance Capital Management, Inc. 2016 Q1 7000
4: GOOGLE Advance Capital Management, Inc. 2016 Q2 7560
5: GOOGLE Advest, Inc. 2015 Q3 12000
6: GOOGLE Advest, Inc. 2015 Q3 13450
I'm trying to use data.table functions, using
df[, qoq := c(NA, diff(Shares)), by = "Holder,stock,Quarter"]
However, I get only NA.
I was expecting something like this:
stock Holder Quarter Shares qoq
1: GOOGLE Advance Capital Management, Inc. 2015 Q3 5800 NA
2: GOOGLE Advance Capital Management, Inc. 2015 Q4 9000 4000
3: GOOGLE Advance Capital Management, Inc. 2016 Q1 7000 -2000
4: GOOGLE Advance Capital Management, Inc. 2016 Q2 7560 560
5: GOOGLE Advest, Inc. 2015 Q3 12000 NA
6: GOOGLE Advest, Inc. 2015 Q3 13450 1450
After that, I need to calculate the variance of this result, again, by Holder and stock. Is there any general function to calculate statistics by grouping several columns? I tried aggregate but is taking yearsssss...
aggregate(REPORTED_HOLDING~Quarter+FILER_NAME+STOCK_NAME, FUN=sum, data=df)
With dplyr, assuming df is you data.frame:
df %>%
group_by(stock, Holder) %>%
mutate(qoq = Shares - lag(Shares)) %>%
summarise(qvar = var(qoq, na.rm = T))
This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 2 years ago.
I'm working on a project where I need to sort data based on how people vote. I cannot find a function where I can delete duplicate rows based on certain conditions being met.
I'm looking for a function that will remove duplicate rows based on one column having duplicate values and another column meeting certain conditions.
For example in the table below I would like to remove voters who voted in three different elections. Paul needs to be removed from this data frame.
df <- data.frame(Name=c("Paul","Paul","Mary","Bill","Jane","Paul","Mary","John",
"Bill","John"),ElectionDay=c("November 2010","November 2014",
"November 2010","November 2010","November 2014","November 2006",
"November 2014","November 2010","November 2014","November 2014"))
df
# Name ElectionDay
# 1 Paul November 2010
# 2 Paul November 2014
# 3 Mary November 2010
# 4 Bill November 2010
# 5 Jane November 2014
# 6 Paul November 2006
# 7 Mary November 2014
# 8 John November 2010
# 9 Bill November 2014
# 10 John November 2014
Below is an example of the result I am looking for:
Name ElectionDay
1 Mary November 2010
2 Bill November 2010
3 Jane November 2014
4 Mary November 2014
5 John November 2010
6 Bill November 2014
7 John November 2014
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Name', we get the length of unique 'ElectionDay' (uniqueN(ElectionDay)). If the length is less than 3, we get the Subset of Data.Table (.SD).
library(data.table)#v1.9.6+
setDT(df)[, if(uniqueN(ElectionDay) < 3) .SD, by = Name]
A similar base R option would be using ave. We get the length of unique elements of 'ElectionDay' grouped by 'Name' and check whether it is less than 3 to get a logical index. The index can be used to subset the rows of dataset.
df[with(df, ave(as.character(ElectionDay), Name,
FUN=function(x) length(unique(x)))) < 3,]
# Name ElectionDay
#3 Mary November 2010
#4 Bill November 2010
#5 Jane November 2014
#7 Mary November 2014
#8 John November 2010
#9 Bill November 2014
#10 John November 2014
The names that occur in more than 2 rows are calculated as
names(which(table(df$Name) > 2))
#[1] "Paul"
So what you need is
df[!(df$Name %in% names(which(table(df$Name) > 2))), ]
# Name ElectionDay
#3 Mary November 2010
#4 Bill November 2010
#5 Jane November 2014
#7 Mary November 2014
#8 John November 2010
#9 Bill November 2014
#10 John November 2014
Or you can also use dplyr, counting the number of elections on which every people voted and then removing the rows for which the count is 3:
library(dplyr)
df %>%
group_by(Name) %>%
mutate(NumberElections = length(unique(ElectionDay))) %>%
ungroup() %>%
filter(NumberElections != 3)