I want to add a new column to a data frame that contains the ith-1 value. I can do this in a for loop but I would like to know it there is a more straightforward way to do it. I would also like to do it for other lags.
Example:
Price PrevPrice
23 NA
24 23
25 24
35 25
You can either do
library(dplyr)
mutate(df, PrevPrice=lag(Price))
Or
df$PrevPrice <- c(NA, df$Price[-nrow(df)])
If you have multiple columns to get the lag, another option is data.table where you can use ?shift By default, the type is lag. For multiple columns, specify the column index (for example, 1st 2 columns here) in .SDcols.
library(data.table) #data.table_1.9.5
setDT(df)[, paste0(names(df)[1:2], 'lag') := shift(.SD), .SDcols=1:2]
Related
In trying to use unite to concatenate two numeric columns, I think this is turning the resulting column into a character string - how to avoid this?
df %>% unite(housinggrossconcat, housinggrossedupnet, housing.gross, na.rm=TRUE, remove=FALSE))
%>% unite(grossconcat, grossedupnet, gross, na.rm=TRUE, remove=FALSE)
Data looks like this:
Gross Grossedupnet New Column: Grossconcat
30 NA 30
NA 45 45
NA 45 45
350 NA 350
It is such that wherever Gross has a value, Grossedupnet will be NA, and vice versa. They're both numerical values. I want to concatenate the two into the new column, but it is turning the new column into a character variable.
You're looking for coalesce:
df %>%
mutate(Grossconcat = coalesce(Gross, Grossedupnet))
dplyr::coalesce() is built for the purpose of taking the first non-missing value. It works on as many columns as you give it, and it works well for all data types. unite is the complement to separate. It is for pasting together strings, hence the character output.
We can also use fcoalesce
library(data.table)
setDT(df)[, Grossconcat := fcoalesce(Gross, Grossedupnet)]
Is there a nice way to make a sub-group within a grouping column in data.table operations?
The result I would like is the output from this:
dt <- data.table(
group = c("a","a","a","b","b","b","c","c"),
value = c(1,2,3,4,5,6,7,8)
)
dt[group!="a", group:="Other"][, sum(value), by=.(group)][]
which gives
group V1
a 6
Other 30
However, this alters the original data.table. I don't know if there is a different way to do this that wouldn't involve merging two data.table. I can imagine a more complicated use case where I want group %in% c("a","b") as one sub-group and group %in% c("c","d") another, etc.
I think this is like a SQL right excluding join (using the terminology here)
You can go through by group and within each group perform an anti-join
#group no longer found in .SD, hence make a copy of the column
dt[, g:=group]
#go through each group, anti-join with other groups, aggregate value
dt[, .(
sumGrpVal=sum(value),
sumNonGrpVal=dt[!.SD, sum(value), on=c("group"="g")]
), by=.(group)]
or an even faster way:
dt[, .(
sumGrpVal=sum(value),
sumNonGrpVal=dt[group!=.BY$group, sum(value)]
), by=.(group)]
output:
group sumGrpVal sumNonGrpVal
1: a 6 30
2: b 15 21
3: c 15 21
I am currently working on life tables, and I have a data set with 19 columns.Column 5 to column 19 contains the dates
for each birth an individual had. I want to create a new variable (column 20) which contains the latest birth (last birth) for each row across 5th to 19th column. The data entries belong to factor class.
Here is how my data looks like
ID_I Sex BirthDate DeathDate Parturition1 Parturition2
501093007 Female 1813-01-14 1859-09-29 1847-11-16 1850-05-17
400707003 Female 1813-01-15 1888-04-14 1844-10-07 1845-10-17
100344004 Female 1813-02-06 1897-05-07 1835-03-09 1837-01-03
I have tried the code, suggested in one of the answers;
df[, "max"] <- apply(df[, 5:19], 1, max)
But I get the overall max across all the rows for the variable df$max. Could it be because my date entries aren't numeric or character?
You're almost there, this should work:
df$max.date <- apply(df[,5:19],1,max)
Based on the example data, we can also use pmax after converting to 'Date' class
df1$max.date <- do.call(pmax,lapply(df1[3:ncol(df1)], as.Date))
df1$max.date
#[1] "1859-09-29" "1888-04-14" "1897-05-07"
NOTE: Change the 3 to 5 in (3:ncol(df1)) in the original dataset.
I am relatively new to R, so this maybe a simple question. I tried searching extensively for an answer but couldn't find one.
I have a data frame in the form:
firstword nextword freq
a little 23
a great 46
a few 32
a good 15
about the 57
about how 34
about a 48
about it 27
by the 36
by his 52
by an 12
by my 16
This is just a tiny sample for illustration from my data set. My dataframe is over a million rows. firstword and nextword are character type. Each firstword can have many nextwords associated with it, while some may have only one.
How do I generate another dataframe from this such that it is sorted by desc. order of freq for each 'firstword' and contains only the top 6 nextwords at most.
I tried the following code.
small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])
This works for smaller subset of my data, but runs out of memory when I run it on my entire data.
Here's a similarly efficient approach using the data.table package.
First, you don't need to arrange freq in every group, sorting in only once is enough and more efficient. So one way would be simply
library(data.table)
setDT(df)[order(-freq), .SD[seq_len(6)], by = firstword]
another way (possibly more efficient) is to find the indexes using the .I argument (Index) and then to subset
indx <- df[order(-freq), .I[seq_len(6)], by = firstword]$V1
df[indx]
dplyr package is created for this purpose to handle large datasets. try this
library(dplyr)
df %>% group_by(firstword) %>% arrange(desc(Freq)) %>% top_n(6)
I have a dataset with numeric and categorical variables with ~200,000 rows, but many variables are constants(both numeric and cat). I am trying to create a new dataset where the length(unique(data.frame$factor))<=1 variables are dropped.
Example data set and attempts so far:
Temp=c(26:30)
Feels=c("cold","cold","cold","hot","hot")
Time=c("night","night","night","night","night")
Year=c(2015,2015,2015,2015,2015)
DF=data.frame(Temp,Feels,Time,Year)
I would think a loop would work, but something isn't working in my 2 below attempts. I've tried:
for (i in unique(colnames(DF))){
Reduced_DF <- DF[,(length(unique(DF$i)))>1]
}
But I really need a vector of the colnames where length(unique(DF$columns))>1, so I tried the below instead, to no avail.
for (i in unique(DF)){
if (length(unique(DF$i)) >1)
{keepvars <- c(DF$i)}
Reduced_DF <- DF[keepvars]
}
Does anyone out there have experience with this type of subsetting/dropping of columns with less than a certain level count?
You can find out how many unique values are in each column with:
sapply(DF, function(col) length(unique(col)))
# Temp Feels Time Year
# 5 2 1 1
You can use this to subset the columns:
DF[, sapply(DF, function(col) length(unique(col))) > 1]
# Temp Feels
# 1 26 cold
# 2 27 cold
# 3 28 cold
# 4 29 hot
# 5 30 hot
Another way with data.table
#Convert object to data.table object
library(data.table)
setDT(DF)
#Drop columns
todrop <- names(DF)[which(sapply(DF,uniqueN)<2)]
DF[, (todrop) := NULL]
One advantage to this method is that it does not make a copy (which might be useful when you have as many columns as you have).
If you are using data.table 1.9.4, you would change to the following:
#Drop columns
todrop <- names(DF)[which(sapply(DF,function(x) length(unique(x)<2))]
DF[, (todrop) := NULL]
I've also another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our dataframe
df with categorical column:
list=pd.DataFrame(df.categorical).columns
df= df.drop(list,axis=1)
df after running the code: