The data that I am trying to work with is in a data.frame that has the following format:
Title Year
Something 2006
something2 2007
Something 2008
Something 2009
I'm specifically interested in being able to subset the data so that their chronological order is fewer then 2008. For example, this would give:
Title Year
Something 2009
Is it acceptable to use something like this:
df[!(df$Year <= 2008), ]
If by fewer mean older (lower number) than you are looking for df<-df[df$Year>2008] or as you do df<-df[!df$Year<=2008]
You will need to overwrite the original data.frame, or it will just display the subset, but not save it. You can also use subset(df, Year>2008) or dplyr package. Whaever suits you best.
Related
I am currently working with data in R. My dataset looks like this (except with about 3 million observations):
This is the current structure of the data:
And I have two objectives...
Objective 1 is to structure it so that it looks like this:
then my second objective is to go back to the original structure and make it look like this:
I have tried variations of aggregate with dcast (which is apparently being deprecated...?)
so, for example, I have tried this:
df2 <- dcast(df1, Store + Sales~ Year, value.var = "Sales")
or even
df %>%
group_by(Store, Year) %>%
summarise(across(starts_with('Sales'), sum))
And I get a diagonal of sales totals across years, but then I'm unable to summarize them so that it looks like
Store Year1 Year2
A $$ $$
B $$ $$
Since there is so much data, it looks like a bunch of stacked identity matrices ...except, instead of 1's there are sales values for the years (there are many many years, not just two).
I am looking for suggestions on how to proceed. One package I found was 'pivottabler' and I have not used it yet, I wanted to see if anyone had any better suggestions first.
:::Much appreciated:::
I have a large dataframe sorted by fiscal year and fiscal period. I am trying to create a time plot starting at fiscal period 1 of 2015, ending at fiscal period 13 of 2019. I have two columns, one for FY, one for FP. They look like this.
I merged the two columns together separated by a 0 in a new column (C) using the code:
MarkP$C = paste(MarkP$FY, MarkP$FP, sep="0")
This ensures that my new column is a numeric variable.
It looks like this (check column C)
Then since I want to plot a time plot of total sales per period, I aggregated all sales to the level of C, so all rows ending with the same C aggregate together. I used this code for the aggregation.
MarkP11 <- MarkP %>%
group_by(C) %>%
summarise(Sales=sum(Sales))
This is what MarkP11 looks like.
The problem i'm having is that the row's are out of order so when I plot them, it gives me an incorrect plot. It has period 10 coming after period 1.
I've done some research and discovered that the sprintf function may work but i'm not sure how I can incorporate that into the code for my data frame.
The code below is how my C column is created by merging two columns. I believe I need to edit this line with the 'sprintf' function but i'm not sure how to get that to work.
R programming
MarkP$C = paste(MarkP$FY, MarkP$FP, sep="0")
I expect the ordering of the MarkP dataframe to look something like this:
sprintf is indeed what you want:
sprintf("%0.0f%02.0f", 2019, c(1,10))
# [1] "201901" "201910"
This assumes that FP's range is 0-99. It would not be incorrect to use sprintf("%d%02d", 2019, c(1,10)) since you're intending to use integers, but sometimes I find that seemingly-integer values can trigger Error ... invalid format '%02d', so I just strong-arm it. You could also use as.integer on each set of values ... another workaround.
I was speaking with a colleague of mine and he helped me figure out the solution. Like r2evans commented, sprintf is the correct function. The syntax that worked for me was:
MarkP$C = paste(MarkP$FY, sprintf("%02d", MarkP$FP), sep-"")
What that did in my code was concatenate the two cells FY and FP together in a new cell titled "C".
-It first added my FY column to the new cell.
-Then, since sep="" there was no separator character so FY and FP were simply merged together.
-Since I added the sprintf function with
("%02d",
it padded the FP column with 0 zero prior to tacking on my FP column.
I have a dataframe as follows:
TAS1 2000 obs. of 9862 variables
Each of these variables (columns) represent daily temperatures from 1979-01-01 to 2005-12-31. The colnames have been set with these dates. I now wish to separate the dataframe into twelve separate monthly data frames - containing Jan, Feb, Mar etc.
I have tried:
TAS1.JAN = subset(TAS1, grepl("-01-"), colnames(TAS1))
But get the error:
Error in grepl("-01-") : argument "x" is missing, with no default
Is there a relatively quick solution for this? I feel there must be but haven't cracked it despite trying various solutions.
I would subset January data like below.
Jan_df <- subset(MyDatSet, select=(grepl("-01-, colnames(MyDatSet))))
I have assumed that your parent dataset is called MyDatSet and a pattern "-01-" defines that it is January data.
You may repeat the process for other 11 months or come up with intelligent loop.
Like Roland, in the comments, suggested, I would opt for melting mechanism too. However, since I do not know your use case, here you go based on what you posted and asked for.
As your error says, you are missing an argument there:
tas1.jan <- subset(df, grepl("-01-", df$tas1))
Another way to do it with the help of stringr and dplyr would be:
library(stringr)
library(dplyr)
tas1.jan <- df %>% filter(str_detect(tas1, "-01-"))
Bottom side of this approach: you need to run a loop or do this 12 times for all months.
I have a column "Year" in my dataframe ("import") and I need to only select 2015 out of some 30 years. However none of the steps I tried worked. Things I tried include:
iy2015<-subset(import, import$year==2015)
iy2015<-import[which(import$year==2015),]
iy2015<-import[import$year==2015,]
all have given me an empty dataframe.
For me your last option works, check if 2015 is in the column and check whether year is a column name. I used: iy2015 = import[import$Year==2015,]
EDIT:
You need to use Year instead of year.
I am using a list of variables to download and create dataframes in R. I'd like to be able to use this list to make changes to different columns in each dataframe, but I am having trouble calling particular columns using the list of variables.
countries= c("USA","CHN")
for (i in 1:length(countries)){
download.file(url[i],savedata[i])
assign(countries[i],xmlToDataFrame(savedata[i]))
}
Now I have dataframes that look like this:
head(USA)
indicator country date value decimal
1 GDP (current US$) United States 2012 15684800000000 0
2 GDP (current US$) United States 2011 14991300000000 0
3 GDP (current US$) United States 2010 14419400000000 0
4 GDP (current US$) United States 2009 13898300000000 0
5 GDP (current US$) United States 2008 14219300000000 0
6 GDP (current US$) United States 2007 13961800000000 0
And I would like to go through and make several changes, such as formatting the date column with the as.date() function, or changing the units of the value column, but I want to be able to do the same to both dataframe (or an arbitrary number in case I increase the length of countries.
However, whenever I try to do this I can seem to use the list of countries in the countries variable to get 'inside' each data frame. My initial guess was putting something like this in a loop:
assign(paste(countries[i],"date",sep="$"),
as.date(get(paste(countries[i],"date",sep="$")))
In particular, I get confused about how the get(paste(countries[i])) works if I am not trying to get the particular column date, and how the paste(countries[i],"date",sep="$") prints the correct name, but I can't seem to get just the one column I'd like to manipulate.
Additionally, I realize loops are not the ideal way of doing this, but I've been having the same problem with the apply functions, though I am likely having trouble with them due to my lack of experience. Suggestions for either how to do it in a loop, or with out, would be much appreciated. Super R novice here, just trying to learn. Also, if you've come across a clear explanation/answer for this somewhere else, I'd appreciate you pointing me towards it.
It's much easier if you use lists. Start with an empty one:
mylist = list()
Then change this:
assign(countries[i],xmlToDataFrame(savedata[i]))
to this:
mylist[[i]] <- xmlToDataFrame(savedata[i])
Then make a function that does your formatting, for instance:
f <- function(df){
within(df, date <- as.date(date))
}
And use lapply to apply it to all dataframes:
mylist2 <- lapply(mylist, f)
If you want to access dataframes by name, use this:
names(mylist2) <- countries
And test:
mylist2[["USA"]]