a question if I may. I am using Jupyter Notebook and Python 3. I am using 3 csv files from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases to produce graphs to track the covid 19 epidemic. These are purely for my use to learn Python and Data Visualisation. I have changed the dates in row one to be Day1, Day2 ect, dropped the Province/State, Lat, Long columns and set the Country/Region column as the index. Each dataset now has 107 columns and 267 rows. The three datasets are cases, deaths and recover. Things are going ok but I have a slight problem and need some advice. The graphs are updated with a new column each day and this causes me some problems when I try write code to show the daily increase in numbers from today over yesterday. Currently I have to manually update my code each day to compensate for the extra columns in the 3 csv files as my code reads like:-
daily_increase_C = [(0,
(cases["Day1"].sum()- (0)),
(cases["Day2"].sum()-cases["Day1"].sum()),
(cases["Day3"].sum()-cases["Day2"].sum()),
(cases["Day4"].sum()-cases["Day3"].sum()),
(cases["Day5"].sum()-cases["Day4"].sum()),
---------------------------------------------
(cases["Day102"].sum()-cases["Day101"].sum()),
(cases["Day103"].sum()-cases["Day102"].sum()),
(cases["Day104"].sum()-cases["Day104"].sum()),
(cases["Day106"].sum()-cases["Day105"].sum()))]
So the last line has to be copied, pasted and then updated each day. There has to be a better way of achieving this but new to coding as I am, I cannot seem to get my head around it and figure it out.
Any advice, pointers, help on how to look at and approach this problem would be greatly appreciated. I hope I have explained this clearly enough for you, if not my apologies and please post your questions needing clarification. Thanks in advance for any help.
Related
Hi I am trying to use AI Builder to scan some titles and populate into a spreadsheet once the pdf is dropped into a folder.
It works fine where it find all the Data but if it can not find any data in the columns starting with SOL then it does bring anything through. I would like it too still bring through any data from the first 3 columns even if nothing is found for the "SOL" columns. Can anyone help please?
Example Output as needed. Currently row 3 will not come through.
Tried some conditions and compose
Maybe you can also post your message in the Power automate community.
Recently I obtained a lot of RTF files that contain econ data for analysis I need to do. Unfortunately, this is how Bureau of Statistics of my country could help with time series data for a long time lapse. If there is a one time need to select particular indicator for 10 years or so I'm OK to find these values by hand using Word/Notepad/TestEdit(for Mac). But my problem is that I have 15 files with data that I need to combine somehow in one dataset for my work. But, before even start doing this I don't have a clue if it is possible to read those files in appropriate format (data.frame). Wanted to ask expert opinion on how to approach this task. An example of one of files could be downloaded from here:
[https://www.dropbox.com/s/863ikx6poid8unc/Export_for_SO.rtf?dl=0][1]
All values are in Russian. Dataset represents export of particular product (first column) across countries (second column) in US dollars for 2 periods.
Thank you.
Use code found on https://datascienceplus.com/how-to-import-multiple-csv-files-simultaneously-in-r-and-create-a-data-frame/
replace read_csv with read_rtf
You may want to manually convert your files to another format using an office suite or a text editor. You should be able to save as in another format.
While in R, you may want to give striprtf a try. I'm guessing you will still have to clean your data a bit afterward.
You can install the package like this:
install.packages("striprtf")
I have not worked with SPSS (.sav) files before and am trying to work with some data files provided to me by importing them into R. I did not receive any explanation of the files, and because communication is difficult I am trying to figure out as much as I can on my own.
Here's my first question. This is what the Date field looks like in an R data frame after import:
> dataset2$Date[1:4]
[1] 13608172800 13608259200 13608345600 13608345600
I don't know what dates the data is supposed to be for, but I found that if I divide the above numbers by 10, that seems to give a reasonable date (in February 2013). Can anyone confirm this is indeed what the above represents?
My second question is regarding another column called Begin_time. Here's what that looks like:
> dataset2$Begin_time[1:4]
[1] 29520 61800 21480 55080
Any idea what this is representing? I want to believe this is some representation of time of day because the records are for wildlife observations, but I haven't got more info than that to try to guess. I noticed that if I take the difference between End_Time and Begin_time I get numbers like 120 and 180, which seems like minutes to me (3 hours seems reasonable to observe a wild animal), but the absolute numbers are far greater than the number of minutes in a day (1440), so that leaves me puzzled. Is this some time keeping format from SPSS? If so, what's the logic?
Unfortunately, I don't have access to SPSS, so any help would be much appreciated.
I had the same problem and this function is a good solution:
pss2date <- function(x) as.Date(x/86400, origin = "1582-10-14")
This is where I found the answer:
http://scs.math.yorku.ca/index.php/R:_Importing_dates_from_SPSS
Dates in SPSS Statistics are represented as floating point doubles holding the number of seconds since Oct 1, 1582. If you use the SPSS R plugin apis, they can be automatically converted to R dates, but any proper converter should be able to do this for you.
I'm new to stats, R, and programming in general, having only had a short course before being thrown in at the deep end. I am keen to work things out for myself, however.
My first task is to check the data I have been given for anomalies. I have been given a spreadsheet with columns Date, PersonID and PlaceID. I assumed that if I plotted each factor of PersonID against Date, a straight line would show that there were no anomalies, as PersonID should only be able to exist in one place at one time. However, I am concerned that if there are 2 of the same PersonID on one Date, my plot has no way of showing this.
I used the simple code:
require(ggplot2)
qplot(Date,PersonID)
My issue is that I am unsure of how to factor the Date into this problem. Essentially, I am trying to check that no PersonID appears in more than one PlaceID on the same Date, and having been trying for 2 days, cannot figure out how to put all 3 of these variables on the same plot.
I am not asking for someone to write the code for me. I just want to know if I am on the right train of thought, and if so, how I should think about asking R to plot this. Can anybody help me? Apologies if this question is rather long winded, or posted in the wrong place.
If all you want to know is whether this occurs in the dataset try duplicated(). For example, assuming your dataframe is called df:
sum(duplicated(df[,c("Date","PersonID")]))
will return the number duplicates based on columns Date and PersonID in the dataframe. If it's greater than zero, you have duplicates in the data.
I want to read in a large ido file that had just under 110,000,000 rows and 8 columns. The columns are made up of 2 integer columns and 6 logical columns. The delimiter "|" is used in the file. I tried using read.big.matrix and it took forever. I also tried dumpDf and it ran out of RAM. I tried ff which I heard was a good package and I am struggling with errors. I would like to do some analysis with this table if I can read it in some way. If anyone has any suggestions that would be great.
Kind Regards,
Lorcan
Thank you for all your suggestions. I managed to figure out why I couldn't get the error to work. I'll give you all answers and suggestions so no one can make my stupid mistake again.
First of all, the data that was been giving to me contained some errors in it so I was doomed to fail from the start. I was unaware until a colleague came across it in another piece of software. In a column that contained integers there were some letters so that when the read.table.ff package tried to read in the data set it somehow got confused or I don't know. Whatever though I was given another sample of data, 16,000,000 rows and 8 columns with correct entries and it worked perfectly. The code that I ran is as follows and took about 30 seconds to read:
setwd("D:/data test")
library(ff)
ffdf1 <- read.table.ffdf(file = "test.ido", header = TRUE, sep = "|")
Thank you all for your time and if you have any questions about the answer feel free to ask and I will do my best to help.
Do you really need all the data for your analysis? Maybe you could aggregate your dataset (say from minute values to daily averages). This aggregation only needs to be done once, and can hopefully be done in chunks. In this way you do need to load all your data into memory at once.
Reading in chunks can be done using scan, the important arguments are skip and n. Alternatively, put your data into a database and extract the chunks in that way. You could even using the functions from the plyr package to run chunks in parallel, see this blog post of mine for an example.