How to separate a column of data.table given conditions - r

After reading some XML files, I am to create a data.table with a specific column names, e.g. Name, Score, Medal, etc. However, I am confused of how i should separate the single column (see the code and results) into many with given criterias.
In my opinion, we either need a cycle just with a step, or a special function, but I do not know what function exactly :/
stage1 <- read_html("1973.html")
stage2 <- xml_find_all(stage1, ".//tr")
xml_text(stage2)
stage3 <- xml_text(xml_find_all(stage2, ".//td"))
stage3
DT <- data.table(stage3, keep.rownames=TRUE, check.names=TRUE, key=NULL,
stringsAsFactors=TRUE)
for (i in seq(from = 1, to = 1375, by = 11)){
if (is.numeric(DT[i,stage3] = FALSE)){
DT$Name <- DT[i,stage3]
}
}
https://pp.userapi.com/c845220/v845220632/1678a5/IRykEniYiiA.jpg
This is example of first 20 rows of 1375
Here how the data.table looks now. What I need, is to separate these results to columns "Name" (e.g. Sergei konyagin), Country (e.g. USSR), score for problems 1-8 (8 columns, respectively), and the medal. The cycle I have written, I think, is something that should extract with a step 11 (since every name, country, etc. repeats every 11 rows) the value from existing column and transfer it into new one. Unfortunately, it doesn't work :/
Thanks in advance for your help!

Give this a shot.
First, load the required packages:
library (data.table)
library (stringr) # this is just for the piping operator %>%
You would read in your own data table here, I am creating one as an example:
dat = c( "Sergey","USSR",1,2,3,4,5,6,7,8,"silver") %>% rep (125) %>% data.table
setnames (dat, "stage3")
As a quick note, I would not be reading in your strings as factors as you do in your own code, because then it can screw up the conversion to numeric.
This will repeat itself to fill out the table. this only works if your table doesn't skip values. also, not advisable to have column names as numbers, better to give them proper names like "test1","test2", etc:
dat [, metadata := c ("name","country",1:8,"medal") ] # whatever you want to name your future 11 columns
dat [, participant := 1: (.N / 11) %>% rep (each = 11) ] # same idea, can't have missing rows
Now, reshape and convert from strings to numeric where possible:
new.dat =
dcast (dat, participant ~ metadata, value.var = "stage3") [, lapply (.SD, type.convert) ]

Related

R: How to Count Rows with Subsetted Date in Date Formatted Column

I have about 30,000 rows of data with a Date column in date format. I would like to be able to count the number of rows by month/year and year, but when I aggregate with the below code, I get a vector within the data table for my results instead of a number.
Using the hyperlinked csv file, I have tried the aggregate function.
https://www.dropbox.com/s/a26t1gvbqaznjy0/myfiles.csv?dl=0
short.date <- strftime(myfiles$Date, "%Y/%m")
aggr.stat <- aggregate(myfiles$Date ~ short.date, FUN = count)
Below is a view of the aggr.stat data frame. There are two columns and the second one beginning with "c(" is the one where I'd like to see a count value.
1 1969/01 c(-365, -358, -351, -347, -346)
2 1969/02 c(-323, -320)
3 1969/03 c(-306, -292, -290)
4 1969/04 c(-275, -272, -271, -269, -261, -255)
5 1969/05 c(-245, -240, -231)
6 1969/06 c(-214, -211, -210, -205, -204, -201, -200, -194, -190, -186)
I'm not much into downloading any unknown file from the internet, so you'll have to adapt my proposed solution to your needs.
You can solve the problem with the help of data.table and lubridate.
Imagine your data has at least one column called dates of actual dates (it is, calling class(df$dates) will return at least Date or something similar (POSIXct, etc).
# load libraries
library(data.table)
library(lubridate)
# convert df to a data.table
setDT(df)
# count rows per month
df[, .N, by = .(monthDate = floor_date(dates, "month")]
.N counts the number of rows, by = groups the data. See ?data.table for further details.
Consider running everything from data frames. Specifically, add needed month/year column to data frame and then run aggregate using data argument (instead of running by separate vectors). Finally, there is no count() function in base R, use length instead:
# NEW COLUMN
myfiles$short.date <- strftime(myfiles$Date, "%Y/%m")
# AGGREGATE WITH SPECIFIED DATA
aggr.stat <- aggregate(Date ~ short.date, data = myfiles, FUN = length)

Vector gets stored as a dataframe instead of being a vector

I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:
cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]

Group by with data.table using sum

I have a dataframe that I want to group by users and find sum of quantity.
library(data.table)
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',')
dt = data.table(x)
colnames(dt)
"dates_d" "user" "proj" "quantity"
the column quantity is like this:
quantity
1
34
12
13
3
12
-
11
1
I heard that data.table library is very fast so I would like to use that.
I have make it in Python but don't know how to do it in R.
Due to historical memory limitation issues, R reads data as factors. When there is a character-like entry in a column, the whole column is read in as a character vector. Now with RAM more easily available, you can just read in data as string first so that it remains as a character vector rather than factor.
Then use as.numeric to convert into a real valued number before summing. Strings that cannot be converted into numbers are converted into NA instead. na.rm=TRUE ignores NAs in the sum.
Taking all of the above:
library(data.table)
#you might want to check out the data.table::fread function to read the data directly as a data.table
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',', stringsAsFactors=FALSE)
setDT(x)[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]
Reference:
a useful comment from phiver in Is there any good reason for columns to be characters instead of factors?
linking to a blog by Roger Peng:
https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
library(dplyr)
dt[dt == "-" ] = NA
df <- dt %>% group_by(user) %>%
summarise(qty = sum(!is.na(quantity)))

Get column names and frequency from a table that looks like a matrix

I have a data frame that looks like:
'Part Number' 'Person Working'
'A' 'James'
'B' 'Brian'
'A' 'Andrea'
'C' 'Tiffany'
and so on for thousands of rows. The same part can have multiple people assigned to it. I'm pretty bad at summarizing data in R, but I'm able to produce (in the console) a table that looks like a frequency matrix by typing:
table(df$partnumber, df$personworking)
and it spits out unique items as rows, and every person working's name as a column. The values are a 0 or a 1 depending on if they are working that part.
What I'm looking for is a way to summarize this information in a digestible format that says, per item:
Part Number NumWorkers Names
A 3 "James, Andrea"
B 1 "Brian"
C 1 "Tiffany"
I'm also struggling with getting my table into a data frame. I've tried:
thedataframe <- data.frame(thetable[,])
but I'm not getting very far. I want to sum the amount of people working each unique part, and concat and print each column name that has a one as a value for a given part.
What is the best way to summarize this data in Base R?
Here is a method you could use in base R with aggregate:
dfAgg <- do.call(data.frame,
aggregate(df$Person, list(df$Parts),
FUN=function(x) c(length(x), paste(x, collapse=", "))))
# add nicer names
names(dfAgg) <- c("Parts", "Count", "Person")
Aggregate allows you to run a function over groups. In this instance, we are running a function that returns both the count of individuals (via length) and their names (via paste).
Here is the sample data I used to test this.
data
set.seed(1234)
df <- data.frame("Parts"=sample(LETTERS[1:3], 10, replace=T),
"Person"=sample(c("James", "Brian", "Sam", "Tiff", "Sandy"),
10, replace=T), stringsAsFactors=F)
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'partnumber', get the number of rows (.N) and paste the 'personworking' in each 'partnumber'.
library(data.table)
setDT(df)[,.(NumWorkers = .N, Names = toString(personworking)) , by = partnumber]
or we could use dplyr
library(dplyr)
df %>%
group_by(partnumber) %>%
summarise(NumWorkers = n(), Names = toString(personworking))
Or using base R
do.call(rbind, by(df, df$partnumber, FUN = function(x)
data.frame(NumWorkers = length(x$personworking), Names = toString(x$personworking))))

How to write loop for creating multiple variables in R

I have a data set in R called data, and in this data set I have more than 600 variables. Among these variables I have 94 variables called data$sleep1,data$sleep2...data$sleep94, and another 94 variables called data$wakeup1,data$wakeup2...data$wakeup94.
I want to create new variables, data$total1-data$total94, each of which is the sum of sleep and wakeup for the same day.
For example, data$total64 <-data$sleep64 + data$wakeup64,data$total94<-data$sleep94+data$wakeup94.
Without a loop, I need to write this code 94 times. I hope someone could give me some tips on this. It doesn't have to be a loop, but an easier way to do this.
FYI, every variables are numeric and have about 30% missing values. The missing are random, it could be anywhere. missing value is a blank but not 0.
I recommend storing your data in long form. To do this, use melt. I'll use data.table.
Sample data:
library(data.table)
set.seed(102943)
x <- setnames(as.data.table(matrix(runif(1880), nrow = 10)),
paste0(c("sleep", "wakeup"), rep(1:94, 2)))[ , id := 1:.N]
Melt:
long_data <-
melt(x, id.vars = "id",
measure.vars = list(paste0("sleep", 1:94),
paste0("wakeup", 1:94)))
#rename the output to look more familiar
#**note: this syntax only works in the development version;
# to install, follow instructions
# here: https://github.com/jtilly/install_github
# to install from https://github.com/Rdatatable/data.table
# (or, read ?setnames and figure out how to make the old version work)
setnames(long_data, -1L, c("day", "sleep", "wakeup"))
I hope you'll find it's much easier to work with the data in this form.
For example, your problem is now simple:
long_data[ , total := sleep + wakeup]
We could do this without a loop. Assuming that the columns are arranged in the sequence mentioned, we subset the 'sleep' columns and 'wakeup' columns separately using grep and then add the datasets together.
sleepDat <- data[grep('sleep', names(data))]
wakeDat <- data[grep('wakeup', names(data))]
nm1 <- paste0('total', 1:94)
data[nm1] <- sleepDat+wakeDat
If there are missing values and they are NA, we can replace the NA values with 0 and then add it together as before.
data[nm1] <- replace(sleepDat, is.na(sleepDat), 0) +
replace(wakeDat, is.na(wakeDat), 0)
If the missing value is '', then the columns would be either factor or character class (not clear from the OP's post). In that case, we may need to convert the dataset to numeric class so that the '' will be automatically converted to NA
sleepDat[] <- lapply(sleepDat, function(x)
as.numeric(as.character(x)))
wakeDat[] <- lapply(wakeDat, function(x)
as.numeric(as.character(x)))
and then proceed as before.
NOTE: If the columns are character, just omit the as.character step and use only as.numeric.

Resources