Group by with data.table using sum - r

I have a dataframe that I want to group by users and find sum of quantity.
library(data.table)
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',')
dt = data.table(x)
colnames(dt)
"dates_d" "user" "proj" "quantity"
the column quantity is like this:
quantity
1
34
12
13
3
12
-
11
1
I heard that data.table library is very fast so I would like to use that.
I have make it in Python but don't know how to do it in R.

Due to historical memory limitation issues, R reads data as factors. When there is a character-like entry in a column, the whole column is read in as a character vector. Now with RAM more easily available, you can just read in data as string first so that it remains as a character vector rather than factor.
Then use as.numeric to convert into a real valued number before summing. Strings that cannot be converted into numbers are converted into NA instead. na.rm=TRUE ignores NAs in the sum.
Taking all of the above:
library(data.table)
#you might want to check out the data.table::fread function to read the data directly as a data.table
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',', stringsAsFactors=FALSE)
setDT(x)[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]
Reference:
a useful comment from phiver in Is there any good reason for columns to be characters instead of factors?
linking to a blog by Roger Peng:
https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

library(dplyr)
dt[dt == "-" ] = NA
df <- dt %>% group_by(user) %>%
summarise(qty = sum(!is.na(quantity)))

Related

R: How to Count Rows with Subsetted Date in Date Formatted Column

I have about 30,000 rows of data with a Date column in date format. I would like to be able to count the number of rows by month/year and year, but when I aggregate with the below code, I get a vector within the data table for my results instead of a number.
Using the hyperlinked csv file, I have tried the aggregate function.
https://www.dropbox.com/s/a26t1gvbqaznjy0/myfiles.csv?dl=0
short.date <- strftime(myfiles$Date, "%Y/%m")
aggr.stat <- aggregate(myfiles$Date ~ short.date, FUN = count)
Below is a view of the aggr.stat data frame. There are two columns and the second one beginning with "c(" is the one where I'd like to see a count value.
1 1969/01 c(-365, -358, -351, -347, -346)
2 1969/02 c(-323, -320)
3 1969/03 c(-306, -292, -290)
4 1969/04 c(-275, -272, -271, -269, -261, -255)
5 1969/05 c(-245, -240, -231)
6 1969/06 c(-214, -211, -210, -205, -204, -201, -200, -194, -190, -186)
I'm not much into downloading any unknown file from the internet, so you'll have to adapt my proposed solution to your needs.
You can solve the problem with the help of data.table and lubridate.
Imagine your data has at least one column called dates of actual dates (it is, calling class(df$dates) will return at least Date or something similar (POSIXct, etc).
# load libraries
library(data.table)
library(lubridate)
# convert df to a data.table
setDT(df)
# count rows per month
df[, .N, by = .(monthDate = floor_date(dates, "month")]
.N counts the number of rows, by = groups the data. See ?data.table for further details.
Consider running everything from data frames. Specifically, add needed month/year column to data frame and then run aggregate using data argument (instead of running by separate vectors). Finally, there is no count() function in base R, use length instead:
# NEW COLUMN
myfiles$short.date <- strftime(myfiles$Date, "%Y/%m")
# AGGREGATE WITH SPECIFIED DATA
aggr.stat <- aggregate(Date ~ short.date, data = myfiles, FUN = length)

How to separate a column of data.table given conditions

After reading some XML files, I am to create a data.table with a specific column names, e.g. Name, Score, Medal, etc. However, I am confused of how i should separate the single column (see the code and results) into many with given criterias.
In my opinion, we either need a cycle just with a step, or a special function, but I do not know what function exactly :/
stage1 <- read_html("1973.html")
stage2 <- xml_find_all(stage1, ".//tr")
xml_text(stage2)
stage3 <- xml_text(xml_find_all(stage2, ".//td"))
stage3
DT <- data.table(stage3, keep.rownames=TRUE, check.names=TRUE, key=NULL,
stringsAsFactors=TRUE)
for (i in seq(from = 1, to = 1375, by = 11)){
if (is.numeric(DT[i,stage3] = FALSE)){
DT$Name <- DT[i,stage3]
}
}
https://pp.userapi.com/c845220/v845220632/1678a5/IRykEniYiiA.jpg
This is example of first 20 rows of 1375
Here how the data.table looks now. What I need, is to separate these results to columns "Name" (e.g. Sergei konyagin), Country (e.g. USSR), score for problems 1-8 (8 columns, respectively), and the medal. The cycle I have written, I think, is something that should extract with a step 11 (since every name, country, etc. repeats every 11 rows) the value from existing column and transfer it into new one. Unfortunately, it doesn't work :/
Thanks in advance for your help!
Give this a shot.
First, load the required packages:
library (data.table)
library (stringr) # this is just for the piping operator %>%
You would read in your own data table here, I am creating one as an example:
dat = c( "Sergey","USSR",1,2,3,4,5,6,7,8,"silver") %>% rep (125) %>% data.table
setnames (dat, "stage3")
As a quick note, I would not be reading in your strings as factors as you do in your own code, because then it can screw up the conversion to numeric.
This will repeat itself to fill out the table. this only works if your table doesn't skip values. also, not advisable to have column names as numbers, better to give them proper names like "test1","test2", etc:
dat [, metadata := c ("name","country",1:8,"medal") ] # whatever you want to name your future 11 columns
dat [, participant := 1: (.N / 11) %>% rep (each = 11) ] # same idea, can't have missing rows
Now, reshape and convert from strings to numeric where possible:
new.dat =
dcast (dat, participant ~ metadata, value.var = "stage3") [, lapply (.SD, type.convert) ]

Get column names and frequency from a table that looks like a matrix

I have a data frame that looks like:
'Part Number' 'Person Working'
'A' 'James'
'B' 'Brian'
'A' 'Andrea'
'C' 'Tiffany'
and so on for thousands of rows. The same part can have multiple people assigned to it. I'm pretty bad at summarizing data in R, but I'm able to produce (in the console) a table that looks like a frequency matrix by typing:
table(df$partnumber, df$personworking)
and it spits out unique items as rows, and every person working's name as a column. The values are a 0 or a 1 depending on if they are working that part.
What I'm looking for is a way to summarize this information in a digestible format that says, per item:
Part Number NumWorkers Names
A 3 "James, Andrea"
B 1 "Brian"
C 1 "Tiffany"
I'm also struggling with getting my table into a data frame. I've tried:
thedataframe <- data.frame(thetable[,])
but I'm not getting very far. I want to sum the amount of people working each unique part, and concat and print each column name that has a one as a value for a given part.
What is the best way to summarize this data in Base R?
Here is a method you could use in base R with aggregate:
dfAgg <- do.call(data.frame,
aggregate(df$Person, list(df$Parts),
FUN=function(x) c(length(x), paste(x, collapse=", "))))
# add nicer names
names(dfAgg) <- c("Parts", "Count", "Person")
Aggregate allows you to run a function over groups. In this instance, we are running a function that returns both the count of individuals (via length) and their names (via paste).
Here is the sample data I used to test this.
data
set.seed(1234)
df <- data.frame("Parts"=sample(LETTERS[1:3], 10, replace=T),
"Person"=sample(c("James", "Brian", "Sam", "Tiff", "Sandy"),
10, replace=T), stringsAsFactors=F)
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'partnumber', get the number of rows (.N) and paste the 'personworking' in each 'partnumber'.
library(data.table)
setDT(df)[,.(NumWorkers = .N, Names = toString(personworking)) , by = partnumber]
or we could use dplyr
library(dplyr)
df %>%
group_by(partnumber) %>%
summarise(NumWorkers = n(), Names = toString(personworking))
Or using base R
do.call(rbind, by(df, df$partnumber, FUN = function(x)
data.frame(NumWorkers = length(x$personworking), Names = toString(x$personworking))))

How to write loop for creating multiple variables in R

I have a data set in R called data, and in this data set I have more than 600 variables. Among these variables I have 94 variables called data$sleep1,data$sleep2...data$sleep94, and another 94 variables called data$wakeup1,data$wakeup2...data$wakeup94.
I want to create new variables, data$total1-data$total94, each of which is the sum of sleep and wakeup for the same day.
For example, data$total64 <-data$sleep64 + data$wakeup64,data$total94<-data$sleep94+data$wakeup94.
Without a loop, I need to write this code 94 times. I hope someone could give me some tips on this. It doesn't have to be a loop, but an easier way to do this.
FYI, every variables are numeric and have about 30% missing values. The missing are random, it could be anywhere. missing value is a blank but not 0.
I recommend storing your data in long form. To do this, use melt. I'll use data.table.
Sample data:
library(data.table)
set.seed(102943)
x <- setnames(as.data.table(matrix(runif(1880), nrow = 10)),
paste0(c("sleep", "wakeup"), rep(1:94, 2)))[ , id := 1:.N]
Melt:
long_data <-
melt(x, id.vars = "id",
measure.vars = list(paste0("sleep", 1:94),
paste0("wakeup", 1:94)))
#rename the output to look more familiar
#**note: this syntax only works in the development version;
# to install, follow instructions
# here: https://github.com/jtilly/install_github
# to install from https://github.com/Rdatatable/data.table
# (or, read ?setnames and figure out how to make the old version work)
setnames(long_data, -1L, c("day", "sleep", "wakeup"))
I hope you'll find it's much easier to work with the data in this form.
For example, your problem is now simple:
long_data[ , total := sleep + wakeup]
We could do this without a loop. Assuming that the columns are arranged in the sequence mentioned, we subset the 'sleep' columns and 'wakeup' columns separately using grep and then add the datasets together.
sleepDat <- data[grep('sleep', names(data))]
wakeDat <- data[grep('wakeup', names(data))]
nm1 <- paste0('total', 1:94)
data[nm1] <- sleepDat+wakeDat
If there are missing values and they are NA, we can replace the NA values with 0 and then add it together as before.
data[nm1] <- replace(sleepDat, is.na(sleepDat), 0) +
replace(wakeDat, is.na(wakeDat), 0)
If the missing value is '', then the columns would be either factor or character class (not clear from the OP's post). In that case, we may need to convert the dataset to numeric class so that the '' will be automatically converted to NA
sleepDat[] <- lapply(sleepDat, function(x)
as.numeric(as.character(x)))
wakeDat[] <- lapply(wakeDat, function(x)
as.numeric(as.character(x)))
and then proceed as before.
NOTE: If the columns are character, just omit the as.character step and use only as.numeric.

Apply function to each column in a data frame observing each columns existing data type

I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:
apply(t,2,max,na.rm=1)
It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".
I then tried this:
sapply(t,max,na.rm=1)
but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.
BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.
If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:
sapply(df, function(x) max(as.numeric(x)) ) # not generally a useful result
Or if you want to test for factors first and return as you expect then:
sapply( df, function(x) if("factor" %in% class(x) ) {
max(as.numeric(as.character(x)))
} else { max(x) } )
#Darrens comment does work better:
sapply(df, function(x) max(as.character(x)) )
max does succeed with character vectors.
The reason that max works with apply is that apply is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply is just a wrapper for lapply, so it is not surprising that both yield the same error.
The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max and min will be undefined, since R is assuming that you've created an unordered factor.
You can change this behavior by specifying options(stringsAsFactors = FALSE), which will change the default for the entire session, or you can pass stringsAsFactors = FALSE in the data.frame() construction call itself. Note that this just means that min and max will assume "alphabetical" ordering by default.
Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.
Regardless, sapply will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:
#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
d[4,] <- NA
#Similar function to DWin's answer
fun <- function(x){
if(is.numeric(x)){max(x,na.rm = 1)}
else{max(as.character(x),na.rm=1)}
}
#Use colwise from plyr package
colwise(fun)(d)
v1 v2 v3 v4
1 0.8478983 j 1.999435 J
If you want to learn your data summary (df) provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.
The best way to do this is avoid base *apply functions, which coerces the entire data frame to an array, possibly losing information.
If you wanted to apply a function as.numeric to every column, a simple way is using mutate_all from dplyr:
t %>% mutate_all(as.numeric)
Alternatively use colwise from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."
t %>% (colwise(as.numeric))
In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert or type_convert from readr.
Less interesting answer: we can apply on each column with a for-loop:
for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }
I don't know of a good way of doing assignment with *apply while preserving data frame structure.
building on #ltamar's answer:
Use summary and munge the output into something useful!
library(tidyr)
library(dplyr)
df %>%
summary %>%
data.frame %>%
select(-Var1) %>%
separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
rename(column_name=Var2) %>%
mutate(value=as.numeric(value),
metric = trimws(metric,'both')
) %>%
filter(!is.na(value)) -> metrics
It's not pretty and it is certainly not fast but it gets the job done!
these days loops are just as fast so this is more than sufficient:
for (I in 1L:length(c(1,2,3))) {
data.frame(c("1","2","3"),c("1","3","3"))[,I] <-
max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}
A solution using retype() from hablar to coerce factors to character or numeric type depending on feasability. I'd use dplyr for applying max to each column.
Code
library(dplyr)
library(hablar)
# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()
# Check max for each column
d %>% summarise_all(max)
Result
Not the new column types.
v1 v2 v3 v4
<dbl> <chr> <dbl> <chr>
1 0.974 j 1.09 J
Data
# Sample data borrowed from #joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df
my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))

Resources