Missing values in truncate data frame - r

I have the next dataframe (t) with 317,000 obs.
date | page | rank
2015-10-10 | url1 | 1
2015-10-10 | url2 | 2
2015-10-10 | url2 | 3
.
.
.
2015-10-10 | url1000 | 1000
2015-10-11 | url1 | 1
I'm trying to truncate this data, because I want to know how much days, a particular URL have maintained in the rank 50 or less.
piv = reshape(t,direction = "wide", idvar = "page", timevar = "date")
If I do that I obtained a table with 27,447 obs and 318 columns, but it generates a lot of NAs. Example below (only 20 columns)
page id.2015-12-07 id.2015-12-08 id.2015-12-09 id.2015-12-10 id.2015-12-11 id.2015-12-12 id.2015-12-13
1 url1 1 1 1 1 1 2 2
id.2015-12-14 id.2015-12-15 id.2015-12-16 id.2015-12-17 id.2015-12-18 id.2015-12-19 id.2015-12-20 id.2015-12-21
1 1 1 1 1 106 534 NA 282
id.2015-12-22 id.2015-12-23 id.2015-12-24 id.2015-12-26
1 270 445 NA NA
Also using cast I had the next error
pivoted = cast(t,page ~ rank + date )
****Using id as value column. Use the value argument to cast to override this choice
Error in `[.data.frame`(data, , variables, drop = FALSE) :
undefined columns selected****
I have 317 uniques dates and 27,447 unique pages or urls.

I suggest you use the dplyr package for this kind of tasks, if this is possible for you:
library(dplyr)
df %>%
filter(rank <= 50) %>%
group_by(page) %>%
summarize(days_in_top_50 = n())
will give you the result you are looking for.
You have row per page and day. The first line (filter) means you only want to consider rows where the rank was in the top 50. The second line (group_by) means you want to get results by page and finally in the third line the n() function counts those rows that pass the filter for each page.
For more information you can check out https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Related

dcast - error: Aggregate function missing

A little background information regarding my question: I have run a trial with 2 different materials, using 2x2 settings. Each treatment was performed in duplo, resulting in a total number of 2x2x2x2 = 16 runs in my dataset. The dataset has the following headings, in which repetition is either 1 or 2 (as it was performed in duplo).
| Run | Repetition | Material | Air speed | Class. speed | Parameter of interest |
I would like to transform this into a dataframe/table which has the following headings, resulting in 8 columns:
| Run | Material | Air speed | Class. speed | Parameter of interest from repetition 1 | Parameter of interest from repetition 2 |
This means that each treatment (combination of material, setting 1 and setting 2) is only shown once, and the parameter of interest is shown twice.
I have a dataset which looks as follows:
code rep material airspeed classifier_speed fine_fraction
1 L17 1 lupine 50 600 1
2 L19 2 lupine 50 600 6
3 L16 1 lupine 60 600 9
4 L22 2 lupine 60 600 12
5 L18 1 lupine 50 1200 4
6 L21 2 lupine 50 1200 6
I have melted it as follows:
melt1 <- melt(duplo_selection, id.vars = c("material", "airspeed", "classifier_speed", "rep"),
measure.vars=c("fine_fraction"))
and then tried to cast it as follows:
cast <- dcast(melt1, material + airspeed + classifier_speed ~ variable, value.var = "value")
This gives the following message:
Aggregate function missing, defaulting to 'length'
and this dataframe, in which the parameter of interest is counted rather than both values being presented.
Thanks for your effort and time to try to help me out, after a little puzzling I found out what I had to do.
I added replicate to each observation, being either a 1 or a 2, as the trial was performed in duplo.
Via the code
cast <- dcast(duplo_selection, material + airspeed + classifier_speed ~ replicate, value.var = "fine_fraction")
I came to the 5 x 8 table I was looking for.

Fixing Column Issue When Importing Data in R

Currently having an issue importing a data set of tweets so that every observation is in one column
This is the data before import; it includes three cells for each tweet, and a blank space in between.
T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz
library(tidyverse)
tweets1 <- read_csv("tweets.txt.gz", col_names = F,
skip_empty_rows = F)
This is the output:
Parsed with column specification:
cols(
X1 = col_character()
)
Warning message:
“71299 parsing failures.
row col expected actual file
35 -- 1 columns 2 columns 'tweets.txt.gz'
43 -- 1 columns 2 columns 'tweets.txt.gz'
59 -- 1 columns 2 columns 'tweets.txt.gz'
71 -- 1 columns 5 columns 'tweets.txt.gz'
107 -- 1 columns 3 columns 'tweets.txt.gz'
... ... ......... ......... ...............
See problems(...) for more details.
”
# A tibble: 1,220,233 x 1
X1
<chr>
1 "T\t2009-06-11 00:00:03"
2 "U\thttp://twitter.com/imdb"
3 "W\tNo Post Title"
4 NA
5 "T\t2009-06-11 16:37:14"
6 "U\thttp://twitter.com/ncruralhealth"
7 "W\tNo Post Title"
8 NA
9 "T\t2009-06-11 16:56:23"
10 "U\thttp://twitter.com/boydjones"
# … with 1,220,223 more rows
The only issue are the many parsing failures, where problems(tweets1) shows that R expected one column, but got multiple. Any ideas on how to fix this? My output should provide me with 1.4 million rows according to my Professor, so unsure if this parsing issue is the key here. Any help is appreciated!
Maybe something like this will work for you.
data
data <- 'T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz'
For a large file, fread() should be quick. The sep = NULL is saying basically just read in full lines. You will replace input = data with file = "tweets.txt.gz".
library(data.table)
read_rows <- fread(input = data, header = FALSE, sep = NULL, blank.lines.skip = TRUE)
processing
You could just stay with data.table, but I noticed you in the tidyverse already.
library(dplyr)
library(stringr)
library(tidyr)
Basically I am grabbing the first character (T, U, W) and storing it into a variable called Column. I am adding another column called Content for the rest of the string, with white space trimmed on both ends. I also added an ID column so I know how to group the clusters of 3 rows.
Then you basically just pivot on the Column. I am not sure if you wanted this last step or not, so remove as needed.
read_rows %>%
mutate(ID = rep(1:3, each = n() / 3),
Column = str_sub(V1, 1, 1),
Content = str_trim(str_sub(V1, 2))) %>%
select(-V1) %>%
pivot_wider(names_from = Column, values_from = Content)
result
# A tibble: 3 x 4
ID T U W
<int> <chr> <chr> <chr>
1 1 2009-06-11 00:00:03 http://twitter.com/imdb No Post Title
2 2 2009-06-11 16:37:14 http://twitter.com/ncruralhealth No Post Title
3 3 2009-06-11 16:56:23 http://twitter.com/boydjones "listening to \"Big Lizard - The Dead Milkmen\" ♫ http://blip.fm/~81kwz"

binning, grouping data in r using values withing a specific range to determine an event or change for survival analysis

I have a dataframe as follows:
df <- data.frame(as.date=c("14/06/2016","15/06/2016","16/06/2016","17/06/2016","18/06/2016","19/06/2016","20/06/2016","21/06/2016","22/06/2016","23/06/2016",
"24/06/2016","04/07/2016","05/07/2016","06/07/2016","07/07/2016","08/07/2016","09/07/2016","10/07/2016","11/07/2016","12/07/2016",
"13/07/2016","14/07/2016","15/07/2016","17/07/2016","18/07/2016","19/07/2016","20/07/2016","21/07/2016","22/07/2016","01/08/2016",
"02/08/2016","03/08/2016","04/08/2016","05/08/2016","06/08/2016","07/08/2016","08/08/2016","09/08/2016","10/08/2016","11/08/2016",
"12/08/2016","13/08/2016","14/08/2016","15/08/2016","16/08/2016","17/08/2016","18/08/2016","19/08/2016","20/08/2016","21/08/2016",
"22/08/2016","23/08/2016","24/08/2016","25/08/2016","26/08/2016","27/08/2016","28/08/2016","29/08/2016","30/08/2016","31/08/2016",
"01/09/2016","02/09/2016","03/09/2016","04/09/2016","05/09/2016","06/09/2016","07/09/2016","08/09/2016","09/09/2016","10/09/2016",
"11/09/2016","12/09/2016","13/09/2016","14/09/2016","15/09/2016","16/09/2016","17/09/2016","18/09/2016","19/09/2016","20/09/2016"),
wear=c("0","55","0","0","0","0","8","8","15","25","30","37","43","49","52","52","55","57","57","61","67","69","2","2","7",
"10","13","14","16","16","19","22","22","24","25","26","29","29","33","34","34","36","38","44","45","48","50","55",
"56","58","0","4","0","4","4","6","9","9","12","14","16","17","25","25","33","36","44","46","48","52","55","59",
"8","9","9","12","24","33","36","44"))
the data is an example of wear rate on a type of metal on a machine, it increases over time them drops to 0, indicating an event or a change,
but the problem that I have is that the wear value doesn't drop off to 0, as you can see from the data, there are 2 variables
as.date = date over time,
wear = wear of metal on a part over time
RANGE in between changes are:
55-0,
60-2,
58-0,
59-8
when it drops from a large number to 0 it is easy to code,I use the following code to change,and add new variables called Status & id
{Creates 2 new columns status & id
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$Status <- 0# fills in column status with all zeros
df$Status[wear > -10 & wear == 0] <- 1 # fill in 1s when wear = 0
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$id <-1# fills in column status with '1's
for(i in 2:nrow(df)){
if(df$Status[i-1]==0){
df$id[i]=df$id[i-1]
}
else {
df$id[i]=df$id[i-1]+1
}
}
}
it will work OK to catch a drop in wear values to 0 but when there isn't, as in the data examples, the wear drops take place from 55-0, 69-2, 58-0, 59-8, within the real data set sometimes there are occasions when the drop in wear values will be negative, not sure on correct way to achieve this, I tried messing around with binning and grouping the data but was unsuccessful.
this is a sample of the data, in the real data set there are 100+ events, mostly a wear value drop to 0 but between 10-20 occasions either dropping to negative values or a values < 10.
I think for-loop is inefficient. We can do something like this using the dplyr and lubridate package.
library(dplyr)
library(lubridate)
df2 <- df %>%
# Convert the as.date column to date class
# Convert the wear column to numeric
mutate(as.date = dmy(as.date),
wear = as.numeric(as.character(wear))) %>%
# Create column show the wear of previous record
mutate(wear2 = lag(wear)) %>%
mutate(Diff = wear - wear2)
The idea is to shift the wear column by 1, and then calculate the difference between the wear of the date and the previous date. the results are saved in the new column as Diff. Here is what the new data frame looks like.
head(df2)
# as.date wear wear2 Diff
# 1 2016-06-14 0 NA NA
# 2 2016-06-15 55 0 55
# 3 2016-06-16 0 55 -55
# 4 2016-06-17 0 0 0
# 5 2016-06-18 0 0 0
# 6 2016-06-19 0 0 0
After this, you can define a threshold in Diff to filter out an end of a period. For example, here I defined the threshold to be -50. You can see that the filter function successfully identify four periods.
# Filter Diff <= -50
df2 %>% filter(Diff <= -50)
# as.date wear wear2 Diff
# 1 2016-06-16 0 55 -55
# 2 2016-07-15 2 69 -67
# 3 2016-08-22 0 58 -58
# 4 2016-09-13 8 59 -51
One final note, in your original data frame, the wear column is in factor, but you calculate it as numeric. This is dangerous. I used wear = as.numeric(as.character(wear)) to convert the column to numeric, but it would be great if you can create the numeric column in the first place.

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

Remove duplicate rows in R data frame, based on a date field and another field

New to R, but learning to handle db data and hit a wall.
I want to remove duplicate rows/observations from a table, based on two criteria: A user ID field and a date field that indicates the last time there was a change to the user, so the most recent dated row.
My truncated data set would look like the following:
UID | DateLastChange
1 | 01/01/2016
1 | 01/03/2016
2 | 01/14/2015
3 | 02/15/2014
3 | 03/15/2016
I would like to end up with:
UID | DateLastChange
1 | 01/03/2016
2 | 01/14/2015
3 | 03/15/2016
I have attempted to use duplicate or unique, but they don't seem to fully embrace the ability to be selective. I can conceive of the possibility to build a new table with unique UIDs, then left join in some way to only match with the most recent date.
Any advice would be much appreciated.
Scott
We can use data.table
library(data.table)
setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y")), head(.SD, 1), by = UID]
# UID DateLastChange
#1: 1 01/03/2016
#2: 2 01/14/2015
#3: 3 03/15/2016
Or using duplicated
setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y"))][!duplicated(UID)]
Using dplyr - data can be in any order
require(dplyr)
dat$DateLastChange <- strptime(dat$DateLastChange, "%m/%d%Y"))
dat %>% group_by(UID) %>% summarize(DateLastChange = max(DateLastChange))

Resources