Related
I'm new to R programming, so this question might be simple.
Anyway, I've tryed to find some answer to this specific thing I'm trying to do and didnt get it.
So, Im trying to import new data I've got to my old data.frame.
The problem is that this data has to substitute previous NA values in variables that already exist.
Also my data have different individuals (companys) in different periods (years), and my new data set only have the companys and years that was missing, plus some observation that I already had.
I tryied to simulate the problem with the data frames below:
Data frame with NAs:
df1 <- data.frame( company = c(rep("A",3), rep("B",3), rep("C",3)),
year = c(rep(2016:2018,each=1)),
income = c(95,87,93,NA,NA,58,102,80,NA),
debt = c(43,50,51,NA,37,37,53,NA,NA),
stringsAsFactors= F )
To search for new data, I created a data set with only the missing data, as my data had to many observations:
df_NA <- data.frame(df1[is.na(df1$income & df1$debt),])
So after searching, I was able to find the missing data, and now I have something like this:
df2 <- data.frame( company = c("A", "B" , "C" , "C"),
year = c(2018, 2016, 2017, 2018),
income = c(60,55, 80, 82),
debt = c(32,37, 53,48),
stringsAsFactors= F )
Now, I'm trying to get this data together, so I have the complete data.frame to work.
The problem is that I couldnt find a way to do it yet. I've tryed merge and join, indexing for company and year, but the variables that have the same name in both data.frame get duplicated and a suffix.
In my data I have much more observations and variables to fill, so I want to find a way I can do it with a command. Also this is going to happen again in the future, so it will be very helpfull.
I'm sorry if this was already answered. Thank you!
Here is an option using data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=c("company", "year"), c('income', 'debt') := { list(i.income, i.debt)}]
# company year income debt
#1: A 2016 95 43
#2: A 2017 87 50
#3: A 2018 60 32
#4: B 2016 55 37
#5: B 2017 NA 37
#6: B 2018 58 37
#7: C 2016 102 53
#8: C 2017 80 53
#9: C 2018 82 48
Or another option using dplyr
library(dplyr)
full_join(df1, df2, by = c("year", "company")) %>%
mutate(
income = coalesce(income.x, income.y),
debt= coalesce(debt.x, debt.y),
) %>%
select(company, year, income, debt)
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame df1 on import data for 397 different industries over 17 years and several different exporting countries/ regions.
> head(df1)
year importer exporter imports sic87dd
2300 1991 USA CAN 9.404848e+05 2011
2301 1991 USA CAN 2.259720e+04 2015
2302 1991 USA CAN 5.459608e+02 2021
2303 1991 USA CAN 1.173237e+04 2022
2304 1991 USA CAN 2.483033e+04 2023
2305 1991 USA CAN 5.353975e+00 2024
However, I want the sum of all imports for a given industry and a given year, regardless of where they came from. (The importer is always the US, sic87dd is a code that uniquely identifies the 397 industries)
So far I have tried the following code, which works correctly but is terribly inefficient and takes ages to run.
sic87dd <- unique(df1$sic87dd)
year <- unique (df1$year)
df2 <- data.frame("sic87dd" = rep(sic87dd, each = 17), "year" = rep(year, 397), imports = rep(0, 6749))
i <- 1
j <- 1
while(i <= nrow(df2)){
while(j <= nrow(df1)){
if((df1$sic87dd[j] == df2$sic87dd[i]) == TRUE & (df1$year[j] == df2$year[i]) == TRUE){
df2$imports[i] <- df2$imports[i] + df1$imports[j]
}
j <- j + 1
}
i <- i + 1
j <- 1
}
Is there a more efficient way to do this? I have seen some questions here that were somewhat similar and suggested the use of the data.table package, but I can't figure out how to make it work in my case.
Any help is appreciated.
There is a simple solution using dplyr:
First, you'll need to set your industry field as a factor (I'm assuming this entire field consists of a 4 digit number):
df1$sic87dd <- as.factor(df1$sic87dd)
Next, use the group_by command and summarise:
df1 %>%
group_by(sic87dd) %>%
summarise(total_imports = sum(imports))
I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}
I'm using R to create an occupancy model encounter history. I need to take a list of bird counts for individual leks, separate them by year, then code the count dates into two intervals, either within 10 days of the first count (Interval 1), or after 10 days after the first count (Interval 2). For any year where only 1 count occurred I need to add an entry coded as "U", to indicate that no count occurred during the second interval. Following that I need to subset out only the max count in each year and interval. A sample dataset:
ComplexId Date Males Year category
57 1941-04-15 97 1941 A
57 1942-04-15 67 1942 A
57 1943-04-15 44 1943 A
57 1944-04-15 32 1944 A
57 1946-04-15 21 1946 A
57 1947-04-15 45 1947 A
57 1948-04-15 67 1948 A
57 1989-03-21 25 1989 A
57 1989-03-30 41 1989 A
57 1989-04-13 2 1989 A
57 1991-03-06 35 1991 A
57 1991-04-04 43 1991 A
57 1991-04-11 37 1991 A
57 1991-04-22 25 1991 A
57 1993-03-23 6 1993 A
57 1994-03-06 17 1994 A
57 1994-03-11 10 1994 A
57 1994-04-06 36 1994 A
57 1994-04-15 29 1994 A
57 1994-04-21 27 1994 A
Now here is the code I wrote to accomplish my task, naming the dataframe above "c1" (you'll need to coerce the date column to date, and the category column to character):
c1_Year<-lapply(unique(c1$Year), function(x) c1[c1$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count, while coercing the "u" into the appropriate interval
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} # subsets out max count in each interval
The problem arises during the second for-loop, which when options(error=recover) is enable returns:
Error in c1_Year[[i]] : subscript out of bounds
No suitable frames for recover()
`
At that point the code accomplishes what it was supposed to and adds the extra line to each year with only one count, even though the error message is generated the extra rows with the "U" code are still appended to the data frames. The issue is that I have 750 leks to do this for. So I tried to build the code above into a function, however when I run the function on any data the subscript out of bounds error stops the function from running. I could brute force it and just run the code above for each lek manually, but I was hoping there might be a more elegant solution. What I need to know is why am I getting the subscript out of bounds error, and how can I fix it?
Here's the function I wrote, so that you can see that it doesn't work:
create.OEH<-function(dataset, final_dataframe){
c1_Year<-lapply(unique(dataset$Year), function(x) dataset[dataset$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count,
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} #subset out max count for each interval
df<-rbind.fill(c1_Year) #collapse list into single dataframe
final_dataframe<-df[!duplicated(df[,c("Year", "Interval")]),] #remove ties for max count
}
In this bit of code
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
You are assigning NULL if length(c1_Year[[i]][,1]==1 is not true, which removes those elements from c1_Year entirely.
You probably want
for(i in 1:length(c1_Year)){
if (length(c1_Year[[i]][,1]) == 1) {
c1_Year[[i]] <- rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
}
However, I see you are already using ddply, so you may be able to avoid a lot of your replication.
The ddply(c1, .(Year), ...) splits up c1 into unique years.
c2 <- ddply(c1,
.(Year),
function (x) {
# create 'Interval'
x$Interval <- ifelse(x$Date - x$Date[1] < 10, 1, 2)
# extract max males per interval
o <- ddply(x, .(Interval), subset, Males==max(Males))
# add the 'U' col if no '2' interval
if (all(o$Interval != 2)) {
o <- rbind(o,
list(o$ComplexId, NA, 0, o$Year, 'U', 2))
}
# return the resulting dataframe
o
})
I converted your rbind(.., c(...)) to rbind(.., list(...)) to avoid converting everything back to string (which is what the c does because it cannot handle multiple different types).
Otherwise the code is almost the same as yours.
I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.