Extraction of year only instead of date in python - jupyter-notebook

Please can someone help with code to extract only the year and set it as a new column in data using python from the above photo attached here. when I try, the result shows no consistency, it gives me different values. it extract both the year and date instead of only the year. I think the year is the second character. I used different code and it isn't working.
I tried using this codes below
df_movies['correct_year'] = df_movies['released'].astype(str).str[-20:]
df_movies['years_scorrect'] = df_movies['released'].astype(str).str[:12]

Related

r: How to manipulate the GGparcoord input column inside the function

I want to compare between the week() of the year of two parallel date columns from two different years. I`m using the GGparcoord function and looking for a way to manipulate the dates in the two columns to be the week count of the specific date. I wish not to manipulate the table itself.
my code is:
ggparcoord(data, columns = 38:39)
and I'm looking for something like ggparcoord(data, columns = week(38):week(39)), that actually works.
In addition, if anyone knows how, I would be happy to learn how to use the ggparcoord with column name instead of column number.
Tnx!

adding a column in dataframe an R date format

Question: Create a new reldate column in the movies data frame in R by converting the column release_date into R date format.
This is my code:
movies <-read.csv("C:/Users/phili/Downloads/movies500.csv")
movies
movies$reldate <- format(as.Date(movies$release_date),"%d/%m/%Y")
print(movies)
Unfortunatly the second code does not add a new column in as R date format.
If you can't answer my question directly, please use a very similar example
In the future it would be helpful to see your data or similar example data instead of a screen shot.
Anyways, looks like there are three things that need to be fixed:
You probably don't need to use the format() function. What you might have wanted is the format= argument within the as.Date() function
"%d/%m/%Y" this part tells R what format to expect the dates should be in, your dates are year month, day, so the order is wrong
similarly your dates are separated by dashes not slashes
So it should look like this: as.Date("2018-09-12",format="%Y-%m-%d")
So in your example try this: as.Date(movies$release_date,format="%Y-%m-%d")
Or because one of the default for as.Date() is "%Y-%m-%d" you could probably just do as.Date(movies$release_date)

Name matching and correcting spelling error in r

I have a huge data table with millions of rows that consists of Merchandise code with its description. I want to assign a category to each group (based on the combination of code and description). The problem is that the description is spelled in different ways and I want to convert all the similar names into a single one. Here is an illustrative example:
ibrary(data.table)
dt <- data.table(code = c(rep(1,2),rep(2,2),rep(3,2)), name = c('McDonalds','Mc
Dnald','Macys','macy','Comcast','Com-cats'))
dt[,cat:='NA']
setkeyv(dt,c('code','name'))
dt[.(1,'McDonalds'),cat:='Restaurant']
dt[.(1,'Mc Dnald'),cat:='Restaurant']
dt[.(1,'Macys'),cat:='Department Store']
Of course in the real case, it is impossible to go through all the spelling that refer to the same word and fix them manually.
Is there a way to detect all the similar words and convert them to a single (correct) spelling?
Thanks in advance

Plotting POSIXct in ggplot manually scaling x-axis

I am trying to plot up this windspeed data, with years displaying on the x-axis. The data frame was set up as
wsAvg<-data.frame(date=as.POSIXct(ws07$date[1224:1559]),u.1=(ws07$u[1224:1559]),stringsAsFactors = FALSE)
wsAvg<-rbind(wsAvg,c(date=as.POSIXct(ws08$date[1032:1367]),(ws08$u[1032:1367])))
And below using ggplot to plot my windspeed data frame.
ggplot(wsAvg,aes(x=date,y=as.numeric(u.1)))+geom_point(size=3,pch=2)+
geom_smooth(method="lm",colour="black",se=FALSE)+
#scale_x_datetime(limits=as.POSIXct(c('2006-09-01','2016-10-01')),breaks=date_breaks("1 year"),labels=date_format("%Y"))+
Without the scale_x_datetime() in my command, I get those dates. When I add in the scale_x_datetime() function to manually scale my x-axis to display only years. All my data lines up onto 2007. Anyone know why this is?
It is very difficult to provide the answer to your question, since we don't have a clear picture of any of your data. With that being said, let's look at the information you did provide and see where the likely source of the problem is for your question.
The issue is clearly related to the formatting/data located in your "date" column. It's best to look at this stepwise and test at each step to see what can go wrong here:
Your raw data: There is likely nothing wrong with your base data, but we don't know the format of the "date" vector coming from ws07$date[1224:1559] and ws08$date[1032:1367]. Your raw data originates from two data frames, so just confirm that the raw data from these two vectors is formatted identically, but more importantly, is it already formatted as a date? What is class(ws08$date)? Also, what does the data look like if you took a sample of that dataset? (e.g. ws07$date[sample(1224:1559, 20)]).
Conversion to POSIXct: The first code you show includes as.POSIXct(), but does not include the argument for format=. You may or may not need to specify this, but I would recommend consulting the documentation to be sure you're using the function correctly. You can try converting a small subset of the data just using as.POSIXct(ws07$date[1224:1250]) or something like that. Does it give you the dates formatted correctly? If not, try specifying the format= arg until it "works" as you intended.
Initial Plot and Second plot The data is spread out in the first plot, likely kind of how you expected. What about the month/day combinations in the first plot - are they correct? If they are correct, it may indicate the year is being read wrong, since apparently all dates are clustered around May and June of 2007. Comparing the first and second plots, there's no obvious issue with scale_x_datetime() here. Those two plots are consistent with data that has x values = dates ranging from May-June of 2007.
Bottom line: Hard to discern exactly where it's going wrong for you, but likely it's (1) in the conversion to date using as.POSIXct from your ws07 and ws08 datasets, or (2) the format of ws07$date or ws08$date being imported/converted incorrectly. The solution is to use the format= argument in the date conversion/import function you are using to ensure that the format is correct and years/months/dates are imported accordingly.
The code that worked for me. Instead of using c() function when I was binding data from other datasets, I had to use data.frame() to add other years into the wsAvg data frame.
wsAvg<-data.frame(date=as.POSIXct(ws07$date[1224:1559]),u.1=(ws07$u[1224:1559]),stringsAsFactors = FALSE)
wsAvg<-rbind(wsAvg,data.frame(date=as.POSIXct(ws08$date[1032:1367]),u.1=(ws08$u[1032:1367])))

Specific date format conversion problems in R

Basically I want to know why as.Date(200322,format="%Y%W") gives me NA. While we are at it, I would appreciate any advice on a data structure for repeated cross-section (aka pseudo-panel) in R.
I did get aggregate() to (sort of) work, but it is not flexible enough - it misses data on columns when I omit the missed values, for example.
Specifically, I have a survey that is repeated weekly for a couple of years with a bunch of similar questions answers to which I would like to combine, average, condition and plot in both dimensions. Getting the date conversion right should presumably help me towards my goal with zoo package or something similar.
Any input is appreciated.
Update: thanks for string suggestion, but as you can see in your own example, %W part doesn't work - it only identifies the year while setting the current day while I need to set a specific week (and leave the day blank).
Use a string as first argument in as.Date() and select a specific weekday (format %w, value 0-6). There are seven possible dates in each week, therefore strptime needs more information to select a unique date. Otherwise the current day and month are returned.
> as.Date(paste("200947", "0", sep="-"), format="%Y%W-%w")
[1] "2009-11-22"

Resources