Import data from a subset of subjects in R - r

I am working with a data set with a combined 300 million rows, split over 5 csv files. The data contains weight measurements of users over 5 years (one file per year). As calculations take ages in this massive data set, I would like to work with a subset of users to create the code. I've used the nrows function to import only the first 50000 lines of each file. However, one user may have 400 weight measurements in the file for year 2014 but only 240 in year 2015. I therefore don't get the same set of users from each file when I import with the nrows function. I am wondering whether there is a way to import the data of the first 1000 users in each file?
The data looks like this in all files:
user_ID date_local weight_kg
0002a3e897bd47a575a720b84aad6e01632d2069 2016-01-07 99.2
0002a3e897bd47a575a720b84aad6e01632d2069 2016-02-08 99.6
0002a3e897bd47a575a720b84aad6e01632d2069 2016-02-10 99.5
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-03-13 99.1
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-04-20 78.2
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-05-02 78.3
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-05-07 78.9
0002b526e65ecdd01f3a373988e63a44d034c5d4 2016-08-15 82.1
0002b526e65ecdd01f3a373988e63a44d034c5d4 2016-08-22 82.6
Thanks a lot in advance!

If you have grep on your system you can combine it with pipe and read.table to read only rows that match a pattern. Using your example data, for example, you could read only users 001 and 002 like this. You'll need to add the headers back later as they won't match the pattern.
mydata <- read.csv(pipe('grep "^00[12]" "mydata.csv"'),
colClasses = c("character", "Date", "numeric"),
header = FALSE)
I'm not sure what the pattern is for your user_ID: you give 001 as an example but state that you want the first 1000. If that is 0001 - 1000, a pattern for grep might be something like ^[01][0-9]{3}.

Related

Removing Duplicates From a Dataframe in R

My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. An example of the data using one of the duplicates is:
id period desc
632 1507 1101 90714 Research a contemporary biological issue
633 1507 1101 6317 Explain the process of speciation
634 1507 1101 8931 Describe gene expression
14448 1507 1201 8931 Describe gene expression
14449 1507 1201 6317 Explain the process of speciation
14450 1507 1201 90714 Research a contemporary biological issue
25884 1507 1301 6317 Explain the process of speciation
25885 1507 1301 8931 Describe gene expression
25886 1507 1301 90714 Research a contemporary biological issue
The first 2 digits of reg_period are the year they sat the paper. As can be seen, I would want to be keeping where id is 1507 and reg_period is 1101. So far, an example of my code to get the values I want to be trimming is:
unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])
However, there are a couple of problems I am then running in to. This only works because the data is ordered by id and reg_period and this isn't guaranteed in future. Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because %in% doesn't seem to work with it and a loop with rbind runs out of memory.
What's the best way to handle this?
I would probably use dplyr. Calling your data df:
result = df %>% group_by(id) %>%
filter(period == min(period))
If you prefer base, I would pull the id/period combinations to keep into a separate data frame and then do an inner join with the original data:
id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)
Try this, it works for me with your data:
dd <- read.csv("a.csv", colClasses=c("numeric","numeric","character"), header=TRUE)
print (dd)
dd <- dd[order(dd$id, dd$period), ]
dd <- dd[!duplicated(dd[, c("id","period")]), ]
print (dd)
Output:
id period desc
1 1507 1101 90714 Research a contemporary biological issue
4 1507 1201 8931 Describe gene expression
7 1507 1301 6317 Explain the process of speciation

Function to identify changes done previously

BACKGROUND
I have a list of 16 data frames. A data frame in it looks like this. All the other data frames have the similar format. DateTime column is of Date class while Value column is of time series class
> head(train_data[[1]])
DateTime Value
739 2009-07-31 49.9
740 2009-08-31 53.5
741 2009-09-30 54.4
742 2009-10-31 56.0
743 2009-11-30 54.4
744 2009-12-31 55.3
I am performing forecasting for the Value column across all the data.frames in this list . The following line of code feeds data into UCM model.
train_dataucm <- lapply(train_data, transform, Value = ifelse(Value > 50000 , Value/100000 , Value ))
The transform function is used to reduce large values because UCM has some issues rounding off large values ( I don't know why though ). I just understood that from user #KRC in this link
One data frame got affected because it had large values which got transformed to log values. All the other dataframes remained unaffected.
> head(train_data[[5]])
DateTime Value
715 2009-07-31 139901
716 2009-08-31 139492
717 2009-09-30 138818
718 2009-10-31 138432
719 2009-11-30 138659
720 2009-12-31 138013
I got to know this because I manually checked each one of the 15 data frames
PROBLEM
Is there any function which can call out the data frames which got
affected due to the condition which I inserted?
The function must be able to list down the data frames which got affected and should be able to put them into a list.
If I will be able to do this, then I can apply anti log function on the values and get the actual values.
This way I can give the correct forecasts with minimal human intervention.
I hope I am clear in specifying the problem .
Thank You.
Simply check whether any of your values in a data frame is too high:
has_too_high_values = function (df)
any(df$Value > 50000)
And then collect them, e.g. using Filter:
Filter(has_too_high_values, train_data)

Many dataframes, different row lengths, similiar columns and dataframe titles, how to bind?

This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.

Summing values for a month in R

please see data sample as follows:
3326 2015-03-03 Wm Eu Apple 2L 60
3327 2015-03-03 Tp Euro 2 Layer 420
3328 2015-03-03 Tpe 3-Layer 80
3329 2015-03-03 14/3 Bgs 145
3330 2015-03-04 T/P 196
3331 2015-03-04 Wm Eu Apple 2L 1,260
3332 2015-03-04 Tp Euro 2 Layer 360
3333 2015-03-04 14/3 Bgs 1,355
Currently graphing this data creates a really horrible graph because the amount of cartons change so rapidly by day. It would make more sense to sum the cartons by month so that each data point represents a sum for that month rather than an individual day. The current range of the data is 11/01/2008-04/01/2015.
This is the code that I am using to graph (which may or may not be relevant for this):
ggvis(myfile, ~Shipment.Date, ~ctns) %>%
layer_lines()
Shipment.Date is column 2 in the data set and ctns is the 4th column.
I don't know much about R and have given it a few trys with some code that I have found here but I don't think I have found a problem similar enough to match the code. My idea is to create a new table, sum Act. Ctns for the month and then save it as that new table and graph from there.
Thanks for any assistance! :)
Do you need this:
data.aggregated<-aggregate(list(new.value=data$value),
by=list(date.time=cut(data$date.time, breaks="1 month")),
FUN=function(x) sum(x))

Merging in R based on dates

I'm using getSymbols to import stock data from Yahoo to R.
When I store it in a data frame, it's in the following format.
IDEA.BO.Open IDEA.BO.High IDEA.BO.Low IDEA.BO.Close IDEA.BO.Volume
2007-03-09 92.40 94.25 84.00 85.55 63599400
2007-03-12 85.55 89.95 85.55 87.40 12490900
2007-03-13 88.50 91.25 86.20 89.85 16785000
2007-03-14 87.05 90.85 86.60 87.75 7763800
2007-03-15 90.00 94.00 88.80 91.45 14808200
2007-03-16 92.40 93.65 91.25 92.40 6365600
Now the date column has no name.
I want to import 2 stock data and merge closing prices (between any random set of rows) on the basis of dates. The problem is, the date column is not being recognized.
I want my final result to be like this.
IDEA.BO.Close BHARTIARTL.BO.Close
2007-03-12 123 333
2007-03-13 456 645
2007-03-14 789 999
I tried the following:
> c <- merge(Cl(IDEA.BO),Cl(BHARTIARTL.BO))
> c['2013-08/']
IDEA.BO.Close BHARTIARTL.BO.Close
2013-08-06 NA 323.40
2013-08-07 NA 326.80
2013-08-08 157.90 337.40
2013-08-09 157.90 337.40
The same data on excel looks like this:
8/6/2013 156.75 8/6/2013 323.4
8/7/2013 153.1 8/7/2013 326.8
8/8/2013 157.9 8/8/2013 337.4
8/9/2013 157.9 8/9/2013 337.4
I don't understand the reason behind the NA values in R and the way to obtain a merged data free of NA Values.
You need to do more reading about xts and zoo data structures. They are matrices with indices that are ordered. When you convert to data.frames they become lists with a 'rownames' attribute which gets displayed by print.data.frame with no header. The list elements are given names based on ht naming of the matrix columns. (I do understand Joshua's visible annoyance at this question since he has posted many SO examples of how to use xts-objects.)

Resources