I have a data set in R called data, and in this data set I have more than 600 variables. Among these variables I have 94 variables called data$sleep1,data$sleep2...data$sleep94, and another 94 variables called data$wakeup1,data$wakeup2...data$wakeup94.
I want to create new variables, data$total1-data$total94, each of which is the sum of sleep and wakeup for the same day.
For example, data$total64 <-data$sleep64 + data$wakeup64,data$total94<-data$sleep94+data$wakeup94.
Without a loop, I need to write this code 94 times. I hope someone could give me some tips on this. It doesn't have to be a loop, but an easier way to do this.
FYI, every variables are numeric and have about 30% missing values. The missing are random, it could be anywhere. missing value is a blank but not 0.
I recommend storing your data in long form. To do this, use melt. I'll use data.table.
Sample data:
library(data.table)
set.seed(102943)
x <- setnames(as.data.table(matrix(runif(1880), nrow = 10)),
paste0(c("sleep", "wakeup"), rep(1:94, 2)))[ , id := 1:.N]
Melt:
long_data <-
melt(x, id.vars = "id",
measure.vars = list(paste0("sleep", 1:94),
paste0("wakeup", 1:94)))
#rename the output to look more familiar
#**note: this syntax only works in the development version;
# to install, follow instructions
# here: https://github.com/jtilly/install_github
# to install from https://github.com/Rdatatable/data.table
# (or, read ?setnames and figure out how to make the old version work)
setnames(long_data, -1L, c("day", "sleep", "wakeup"))
I hope you'll find it's much easier to work with the data in this form.
For example, your problem is now simple:
long_data[ , total := sleep + wakeup]
We could do this without a loop. Assuming that the columns are arranged in the sequence mentioned, we subset the 'sleep' columns and 'wakeup' columns separately using grep and then add the datasets together.
sleepDat <- data[grep('sleep', names(data))]
wakeDat <- data[grep('wakeup', names(data))]
nm1 <- paste0('total', 1:94)
data[nm1] <- sleepDat+wakeDat
If there are missing values and they are NA, we can replace the NA values with 0 and then add it together as before.
data[nm1] <- replace(sleepDat, is.na(sleepDat), 0) +
replace(wakeDat, is.na(wakeDat), 0)
If the missing value is '', then the columns would be either factor or character class (not clear from the OP's post). In that case, we may need to convert the dataset to numeric class so that the '' will be automatically converted to NA
sleepDat[] <- lapply(sleepDat, function(x)
as.numeric(as.character(x)))
wakeDat[] <- lapply(wakeDat, function(x)
as.numeric(as.character(x)))
and then proceed as before.
NOTE: If the columns are character, just omit the as.character step and use only as.numeric.
Related
I have been using Stata and the loops are easily executed there. However, in R I have faced some errors in looping over variables. I tried some of the codes over here and it does not work. Basically, I am trying to clean the data by logging the values. I had to convert negative values to positive first before logging them.
I intend to loop over multiple firm statistics on the dataframe but I faced errors in doing so.
varlist <- c("revenue", "profit", "cost")`
for (v in varlist) {
data$log_v <- log(abs(ifelse(data$v>1, data$v, NA)))
data$log_v <- ifelse(data$v<0, data$log_v*-1,data$log_v)
}
Error in $<-.data.frame(tmp,"log_v", value = numeric(0)) : replacement has 0 rows, data has 9
It looks like you might be assuming that data$log_v is getting read as data$log_profit, but R's going to take it own it's own and read it as "log_v" all 3 times. This example might not be quite everything you're trying to do but it might help you. It's taking a list of variables and referencing them via their string names.
df <- data.frame(x = rnorm(15), y = rnorm(15))
vars <- c("x", "y")
for (v in vars) {
df[paste0("log_", v)] <- log(abs(df[v]))
}
Here's roughly the same thing in data.table.
library(data.table)
dt <- data.table(x = rnorm(15), y = rnorm(15))
dt[, `:=`(log_x = log(abs(x)), log_y = log(abs(y)))]
Here is an explanation to the source of your confusion:
A data.frame is a special type of list, it's elements are vectors of the same length – columns. Normally, you access an element of a list using the [[ function, for example df[["revenue"]]. Instead of "revenue", you can also use a variable, such as df[[varlist[1]]]. So far, so good.
However, lists have a convenience operator, $, which allows you to access the elements with less typing: df$revenue. Unfortunately, you cannot use variables this way: this by design. Since you don't have to use quotes with $, the operator cannot know whether you mean revenue as the literal name of the element or revenue as the variable that holds the literal name of the element.
Therefore, if you want to use variables, you need to use the [[ function, and not the $. Since programmers hate typing and want to make code as terse as possible, various ways around it have been invented, such as data.tables and tidyverse (I am exaggerating a bit here).
Also, here is a tidyverse solution.
library(tidyverse)
varlist <- c("revenue", "profit", "cost")
df <- data.frame(revenue=rnorm(100), profit=rnorm(100), cost=rnorm(100))
df <- df %>% mutate_at(varlist, list(log10 = ~ log10(abs(.))))
Explanation:
mutate_all applies log10(abs(.)) to every column. The dot . is a temporary variable that hold the column values for each of the columns.
by default, mutate_all will replace the existing variables. However, if instead of providing a function (~ log10(abs(.))) you provide a named list (list(log10 = ~ log10(abs(.)))), it will add new columns using log10 as a suffix in column name.
this method makes it easy to apply several functions to your columns, not only the one.
See? No (obvious) loops at all!
After reading some XML files, I am to create a data.table with a specific column names, e.g. Name, Score, Medal, etc. However, I am confused of how i should separate the single column (see the code and results) into many with given criterias.
In my opinion, we either need a cycle just with a step, or a special function, but I do not know what function exactly :/
stage1 <- read_html("1973.html")
stage2 <- xml_find_all(stage1, ".//tr")
xml_text(stage2)
stage3 <- xml_text(xml_find_all(stage2, ".//td"))
stage3
DT <- data.table(stage3, keep.rownames=TRUE, check.names=TRUE, key=NULL,
stringsAsFactors=TRUE)
for (i in seq(from = 1, to = 1375, by = 11)){
if (is.numeric(DT[i,stage3] = FALSE)){
DT$Name <- DT[i,stage3]
}
}
https://pp.userapi.com/c845220/v845220632/1678a5/IRykEniYiiA.jpg
This is example of first 20 rows of 1375
Here how the data.table looks now. What I need, is to separate these results to columns "Name" (e.g. Sergei konyagin), Country (e.g. USSR), score for problems 1-8 (8 columns, respectively), and the medal. The cycle I have written, I think, is something that should extract with a step 11 (since every name, country, etc. repeats every 11 rows) the value from existing column and transfer it into new one. Unfortunately, it doesn't work :/
Thanks in advance for your help!
Give this a shot.
First, load the required packages:
library (data.table)
library (stringr) # this is just for the piping operator %>%
You would read in your own data table here, I am creating one as an example:
dat = c( "Sergey","USSR",1,2,3,4,5,6,7,8,"silver") %>% rep (125) %>% data.table
setnames (dat, "stage3")
As a quick note, I would not be reading in your strings as factors as you do in your own code, because then it can screw up the conversion to numeric.
This will repeat itself to fill out the table. this only works if your table doesn't skip values. also, not advisable to have column names as numbers, better to give them proper names like "test1","test2", etc:
dat [, metadata := c ("name","country",1:8,"medal") ] # whatever you want to name your future 11 columns
dat [, participant := 1: (.N / 11) %>% rep (each = 11) ] # same idea, can't have missing rows
Now, reshape and convert from strings to numeric where possible:
new.dat =
dcast (dat, participant ~ metadata, value.var = "stage3") [, lapply (.SD, type.convert) ]
I work mostly with data table but a data frame solution would work as well.
I have the result of an apply which returns this data structure
applyres=structure(c(0.0260, 3.6775, 0.92
), .Names = c("a.1", "a.2", "a.3"))
Then I have a data table
coltoadd=c('a.1','a.2','a.3')
dt <- data.table(variable1 = factor(c("what","when","where")))
dt[,coltoadd]=as.numeric(NA)
Now I would like to add the elements of applyres to the corresponding columns, just one row at a time, because applyres is calculated from another function. I have tried different assignments but nothing seems to work. Ideally I would like to assign based on column name, just in case the columns change order in one of the two structures.
This doesn't work
dt[1,coltoadd]=applyres
I also tried
dt[1,coltoadd := applyres]
And tried to change applyrest to a matrix or a data table and transpose.
I would like to do something like this
dt[1,coltoadd[i]]=applyres[coltoadd[i]]
But not sure if it should go in a loop, doesn't seem the best way to do it.
How do I avoid doing single assignments if I have a large number of columns?
1) data.frame Convert to data.frame, perform the assignments and convert back.
DF <- as.data.frame(dt)
DF[1, -1] <- applyres
# perform remaining of assignments
dt <- as.data.table(DF)
2) loop Another possibility is a for loop:
for(i in 2:ncol(dt)) dt[1, i] <- applyres[i-1]
I have a dataframe that I want to group by users and find sum of quantity.
library(data.table)
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',')
dt = data.table(x)
colnames(dt)
"dates_d" "user" "proj" "quantity"
the column quantity is like this:
quantity
1
34
12
13
3
12
-
11
1
I heard that data.table library is very fast so I would like to use that.
I have make it in Python but don't know how to do it in R.
Due to historical memory limitation issues, R reads data as factors. When there is a character-like entry in a column, the whole column is read in as a character vector. Now with RAM more easily available, you can just read in data as string first so that it remains as a character vector rather than factor.
Then use as.numeric to convert into a real valued number before summing. Strings that cannot be converted into numbers are converted into NA instead. na.rm=TRUE ignores NAs in the sum.
Taking all of the above:
library(data.table)
#you might want to check out the data.table::fread function to read the data directly as a data.table
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',', stringsAsFactors=FALSE)
setDT(x)[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]
Reference:
a useful comment from phiver in Is there any good reason for columns to be characters instead of factors?
linking to a blog by Roger Peng:
https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
library(dplyr)
dt[dt == "-" ] = NA
df <- dt %>% group_by(user) %>%
summarise(qty = sum(!is.na(quantity)))
After reading about benchmarks and speed comparisons of R methods, I am in the process of converting to the speedy data.table package for data manipulation on my large data sets.
I am having trouble with a particular task:
For a certain observed variable, I want to check, for each station, if the absolute lagged difference (with lag 1) is greater than a certain threshold. If it is, I want to replace it with NA, else do nothing.
I can do this for the entire data.table using the set command, but I need to do this operation by station.
Example:
# Example data. Assume the columns are ordered by date.
set.seed(1)
DT <- data.table(station=sample.int(n=3, size=1e6, replace=TRUE),
wind=rgamma(n=1e6, shape=1.5, rate=1/10),
other=rnorm(n=1.6),
key="station")
# My attempt
max_rate <- 35
set(DT, i=which(c(NA, abs(diff(DT[['wind']]))) > max_rate),
j=which(names(DT)=='wind'), value=NA)
# The results
summary(DT)
The trouble with my implementation is that I need to do this by station, and I do not want to get the lagged difference between the last reading in station 1 and the first reading of station 2.
I tried to use the by=station operator within the [ ], but I am not sure how to do this.
One way is to get the row numbers you've to replace using the special variable .I and then assign NA to those rows by reference using the := operator (or set).
# get the row numbers
idx = DT[, .I[which(c(NA, diff(wind)) > 35)], by=station][, V1]
# then assign by reference
DT[idx, wind := NA_real_]
This FR #2793 filed by #eddi when/if implemented will have a much more natural way to accomplish this task by providing the expression resulting in the corresponding indices on LHS and the value to replace with on RHS. That is, in the future, we should be able to do:
# in the future - a more natural way of doing the same operation shown above.
DT[, wind[which(c(NA, diff(wind)) > 35)] := NA_real_, by=station]