Fastest way to fread/process "JSON-like" column in data.table? - r

Here's some sample data I'm working with. DT_IN contains the input format of the data and DT_OUT contains the form that I would like to use. What's the best way to go from DT_IN to DT_OUT?
I've tried strsplit, but did not manage to order the splits to rbind them in the corresponding order. Am open to solutions, maybe Rcpp could help?
library(data.table)
DT_IN <- data.table(
user_id = c(1L, 20L, 4L, 6L, 9L),
latitude = c(-41.3103218, -40.8307381, -37.3932037, -42.7178726, -45.0156822),
longitude = c(174.824554, 172.793106, 175.840637, 170.965454, 168.731186),
parameters = c(
"{\"\"network\"\"=>\"\"Telecom NZ\"\", \"\"accuracy\"\"=>28.659999847412, \"\"internet\"\"=>\"\"4G\"\", \"\"location_age\"\"=>1}",
"{\"\"location_age\"\"=>716}", "{\"\"location_age\"\"=>851}", "{\"\"accuracy\"\"=>14, \"\"location_age\"\"=>1}",
"{\"\"network\"\"=>\"\"VodafoneNZ\"\", \"\"accuracy\"\"=>29, \"\"internet\"\"=>\"\"3G\"\", \"\"location_age\"\"=>31}"
)
)
> DT_IN
user_id latitude longitude parameters
1: 1 -41.31032 174.8246 {""network""=>""Telecom NZ"", ""accuracy""=>28.659999847412, ""internet""=>""4G"", ""location_age""=>1}
2: 20 -40.83074 172.7931 {""location_age""=>716}
3: 4 -37.39320 175.8406 {""location_age""=>851}
4: 6 -42.71787 170.9655 {""accuracy""=>14, ""location_age""=>1}
5: 9 -45.01568 168.7312 {""network""=>""VodafoneNZ"", ""accuracy""=>29, ""internet""=>""3G"", ""location_age""=>31}
DT_OUT <- data.table(
user_id = c(1L, 20L, 4L, 6L, 9L),
latitude = c(-41.3103218, -40.8307381, -37.3932037, -42.7178726, -45.0156822),
longitude = c(174.824554, 172.793106, 175.840637, 170.965454, 168.731186),
network = c('Telecom NZ', NA, NA, NA, 'VodafoneNZ'),
accuracy = c(28.659999847412, NA, NA, 14, 29),
internet = c('4G', NA, NA, NA, '3G'),
location_age = c(1, 716, 851, 1, 31)
)
> DT_OUT
user_id latitude longitude network accuracy internet location_age
1: 1 -41.31032 174.8246 Telecom NZ 28.66 4G 1
2: 20 -40.83074 172.7931 <NA> NA <NA> 716
3: 4 -37.39320 175.8406 <NA> NA <NA> 851
4: 6 -42.71787 170.9655 <NA> 14.00 <NA> 1
5: 9 -45.01568 168.7312 VodafoneNZ 29.00 3G 31

Using the jsonlite package ...
# Convert json like strings to json.
DT_IN[, parameters := gsub("\"\"", "\"", parameters)]
DT_IN[, parameters := gsub("=>", ":", parameters)]
# Stream_in the json and cbind it to existing data.
DT_IN <- cbind(DT_IN, jsonlite::stream_in(textConnection(DT_IN$parameters)))
# Remove `parameters`
DT_IN[, parameters := NULL]
DT_IN
# user_id latitude longitude network accuracy internet location_age
# 1: 1 -41.31032 174.8246 Telecom NZ 28.66 4G 1
# 2: 20 -40.83074 172.7931 <NA> NA <NA> 716
# 3: 4 -37.39320 175.8406 <NA> NA <NA> 851
# 4: 6 -42.71787 170.9655 <NA> 14.00 <NA> 1
# 5: 9 -45.01568 168.7312 VodafoneNZ 29.00 3G 31

Related

(R) How to copy paste values from one column based on another column and ID in R

For simplicity reasons, let's assume I have two columns.
First: ID (string of codes such as AA23, BA53, NA, etc.)
Second: Age (18, 32, 55, 23, etc.)
And IDs sometimes repeat (i.e., one person - AA23 filled the survey in two days, but only on the first day was asked how old he is, but during the second and third day not).
I want to copy paste values from the Age column based on the ID, so that I have a 'long format' of the dataframe.
dput(data):
structure(list(Code = c("MW68", "AW80", "EW40", "BW60", "Wn36",
"ZK45", "SI55", "MW68", "EW40", "DC06", NA, "IW28"), Age = c("52",
"26", "34", "26", "20", "35", NA, NA, NA, NA, NA, NA)), row.names = c(5L,
6L, 7L, 8L, 9L, 10L, 400L, 401L, 402L, 403L, 404L, 405L), class = "data.frame")
Input:
ID Age
AA23 18
BA53 32
AC13 55
AA23 NA
BA53 NA
AC13 NA
NA 23
AA23 NA
(the trick is that sometimes ID is NA)
And the desired output:
ID Age
AA23 18
BA53 32
AC13 55
AA23 18
BA53 32
AC13 55
NA 23
AA23 18
Thank you in advance!
You can also use the function coalesce which finds the first NA value and replace it with the value you define, here we would like it to be the first value of every Age variable (grouping variable):
library(dplyr)
df %>%
group_by(Code) %>%
mutate(across(Age, ~ coalesce(.x, first(.x))))
# A tibble: 12 x 2
# Groups: Code [10]
Code Age
<chr> <chr>
1 MW68 52
2 AW80 26
3 EW40 34
4 BW60 26
5 Wn36 20
6 ZK45 35
7 SI55 NA
8 MW68 52
9 EW40 34
10 DC06 NA
11 NA NA
12 IW28 NA
I'm not quite sure if I understood correctly what you want to do, but this code here should look where Age is NA and fill in the mean of the Age from the other rows with the same entry in Code. Obviously, this will fail if there are values for Code where no Age value exists anywhere in the table. If there are various values for Age in different rows with the same Code, it will fill in the mean in this example, since you didn't specify what to do in such a case.
for(i in 1:nrow(data)){
if(!is.na(data$Code[i])){
if(is.na(data$Age[i])){
data$Age[i] <- mean(data$Age[data$Code == data$Code[i]], na.rm = TRUE)
}
}
}
This skips rows with NA in the Code column.
Here's a solution based on zoo's function na.locf("in the case of NA, last observation carried forward"): first you group by Codethen you mutate column Ageusingifelse and carrying the last non-NA` observation forward:
library(zoo)
data %>%
group_by(Code) %>%
mutate(Age = ifelse(is.na(Age), na.locf(Age), Age))
# A tibble: 12 x 2
# Groups: Code [10]
Code Age
<chr> <chr>
1 MW68 52
2 AW80 26
3 EW40 34
4 BW60 26
5 Wn36 20
6 ZK45 35
7 SI55 NA
8 MW68 52 # <- value `carried forward`
9 EW40 34 # <- value `carried forward`
10 DC06 NA
11 NA NA
12 IW28 NA

Sorting data via if statement in R

I have a large CSV of workout data extracted from GPX files consisting of 6 columns:
1. No (e.g, (1 through ~900 thousand)
2. latitude (e.g., 34.105329,
3. longitude (e.g., -118.299236),
4. elevation (in meters,
5. date (e.g., 10/20/2017),
6. time (2:08:05 AM)
I would like to establish a column that notes the workout number, e.g., workout 1 encompasses rows 1 through 2000 and workout 2 encompasses rows 2001 through 5000. I was able to accomplish in Excel with an If statement, but have not figured out how to accomplish this in R.
Basically if a data point was recorded on the same day AND within two hours of the preceding data point, both points belonged to the same workout. If data points were logged in the same day but were separated by more than 2 hours they belong to two separate workouts. I've pasted some data below that include the first few rows of Workout 1 and the first few rows of Workout 2 (just enough to demonstrate how the Excel formula works).
Dput Code:
dput(droplevels(mydata[1:10, ]))
Dput Output:
structure(list(No = 1:10, Latitude = c(34.092483, 34.092534,
34.092573, 34.092624, 34.092652, 34.092684, 34.092712, 34.092742,
34.092774, 34.092808), Longitude = c(-118.300414, -118.300448,
-118.300434, -118.300431, -118.300428, -118.300425, -118.300423,
-118.300425, -118.300426, -118.300427), Altitude = c(104.2, 104.2,
104.3, 104.4, 104.4, 104.5, 104.5, 104.5, 104.6, 104.6), Date = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "10/20/2017", class = "factor"),
Time = structure(1:10, .Label = c("1:40:18", "1:43:06", "1:43:08",
"1:43:10", "1:43:11", "1:43:12", "1:43:13", "1:43:14", "1:43:15",
"1:43:16"), class = "factor")), row.names = c(NA, 10L), class = "data.frame")
Data Sample:
No Latitude Longitude Altitude Date Time Workout#
1 34.092483 -118.300414 104.2 10/20/2017 1:40:18 1
2 34.092534 -118.300448 104.2 10/20/2017 1:43:06 1
3 34.092573 -118.300434 104.3 10/20/2017 1:43:08 1
4 34.092624 -118.300431 104.4 10/20/2017 1:43:10 1
5 34.092652 -118.300428 104.4 10/20/2017 1:43:11 1
1332 34.092487 -118.300577 104.1 11/4/2017 1:23:24 2
1333 34.092513 -118.300565 104.2 11/4/2017 1:23:25 2
1334 34.09255 -118.30053 104.3 11/4/2017 1:23:26 2
1335 34.092592 -118.300495 104.4 11/4/2017 1:23:28 2
1336 34.092619 -118.300481 104.4 11/4/2017 1:23:29 2
1337 34.092668 -118.300467 104.5 11/4/2017 1:23:31 2
Edit:
Thank you to #AllanCameron and #GregorThomas. I ran your code and summed it up using the code below which yields the desired results.
cumsum <- cumsum(c(1, as.numeric(diff(workout_times) > 7200)))
# Add 'cumsum' to 'mydata' data frame
mydata$cumsum <- cumsum
sqldf("select distinct(cumsum) from mydata")
Assuming that your workouts are more than 30 minutes apart, you can do this:
workout_times <- as.POSIXct(paste(df$Date, df$Time), format = "%m/%d/%Y %H:%M:%S")
cumsum(c(1, as.numeric(diff(workout_times) > 1800)))
#> [1] 1 1 1 1 1 2 2 2 2 2 2
You can change the 1800 to a number of seconds between workouts that seems best for you.

r Replace only some table values with values from alternate table

This is not a "vlookup-and-fill-down" question.
My source data is excellent at delivering all the data I need, just not in in a usable form. Recent changes in volume mean manually adjusted fixes are no longer feasible.
I have an inventory table and a services table. The inventory report does not contain purchase order data for services or non-inventory items. The services table (naturally) does. They are of course different shapes.
Pseudo-coding would be something to the effect of for every inventory$Item in services$Item, replace inventory$onPO with services$onPO.
Sample Data
inv <- structure(list(Item = c("10100200", "10100201", "10100202", "10100203",
"10100204", "10100205-A", "10100206", "10100207", "10100208",
"10100209", "10100210"), onHand = c(600L, NA, 39L, 0L, NA, NA,
40L, 0L, 0L, 0L, 0L), demand = c(3300L, NA, 40L, 40L, NA, NA,
70L, 126L, 10L, 10L, 250L), onPO = c(2700L, NA, 1L, 40L, NA,
NA, 30L, 126L, 10L, 10L, 250L)), .Names = c("Item", "onHand",
"demand", "onPO"), row.names = c(NA, -11L), class = c("data.table",
"data.frame"))
svc <- structure(list(Item = c("10100201", "10100204", "10100205-A"),
`Rcv'd` = c(0L, 0L, 44L), Backordered = c(20L, 100L, 18L)), .Names = c("Item",
"Rcv'd", "Backordered"), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
Assuming you want to replace NAs in onPO with values from Backordered here is a solution using dplyr::left_join:
library(dplyr);
left_join(inv, svc) %>%
mutate(onPO = ifelse(is.na(onPO), Backordered, onPO)) %>%
select(-Backordered, -`Rcv'd`);
# Item onHand demand onPO
#1 10100200 600 3300 2700
#2 10100201 NA NA 20
#3 10100202 39 40 1
#4 10100203 0 40 40
#5 10100204 NA NA 100
#6 10100205-A NA NA 18
#7 10100206 40 70 30
#8 10100207 0 126 126
#9 10100208 0 10 10
#10 10100209 0 10 10
#11 10100210 0 250 250
Or a solution in base R using merge:
inv$onPO <- with(merge(inv, svc, all.x = TRUE), ifelse(is.na(onPO), Backordered, onPO))
Or using coalesce instead of ifelse (thanks to #thelatemail):
library(dplyr);
left_join(inv, svc) %>%
mutate(onPO = coalesce(onPO, Backordered)) %>%
select(-Backordered, -`Rcv'd`);
In data.table world, this is an "update-join". Join on "Item" and then update the values in the original set with the values from the new set:
library(data.table)
setDT(inv)
setDT(svc)
inv[svc, on="Item", c("onPO","onHand") := .(i.Backordered, `i.Rcv'd`)]
#inv original table
#svc update table
#on= match on specified variable
# := overwrite onPO with Backordered
# onHand with Rcv'd
# Item onHand demand onPO
# 1: 10100200 600 3300 2700
# 2: 10100201 0 NA 20
# 3: 10100202 39 40 1
# 4: 10100203 0 40 40
# 5: 10100204 0 NA 100
# 6: 10100205-A 44 NA 18
# 7: 10100206 40 70 30
# 8: 10100207 0 126 126
# 9: 10100208 0 10 10
#10: 10100209 0 10 10
#11: 10100210 0 250 250
Starting with the tables:
>inv
Item OnHand Demand OnPO
1: 10100200 600 3300 2700
2: 10100201 NA NA NA
3: 10100202 39 40 1
4: 10100203 0 40 40
5: 10100204 NA NA NA
6: 10100205-A NA NA NA
7: 10100206 40 70 30
8: 10100207 0 126 126
9: 10100208 0 10 10
10: 10100209 0 10 10
11: 10100210 0 250 250
> svc
Item Rcv'd Backordered
1: 10100201 0 20
2: 10100204 0 100
3: 10100205-A 44 18
After far more cursing than I'd like to admit, the simple solution that works on the above test data, and my live data proved to be:
# Insert OnHand and OnPO data from svc
for (i in 1:nrow(inv)) {
if(inv$Item[i] %in% svc$Item) {
x <- which(svc$Item == inv$Item[i])
inv$OnPO[i] <- svc$Backordered[x]
inv$OnHand[i] <- svc$`Rcv'd`[x]
}
else{}
}
# cleanup
inv[is.na(inv)] <- 0
Is there a simpler or more obvious method that I've overlooked?
We could use eat from my package safejoin, and "patch"
the matches from the rhs into the lhs when columns conflict.
We rename Backordered to onPO on the way so the two columns conflict as desired.
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
eat(inv, svc, onPO = Backordered, .conflict = "patch")
# Item onHand demand onPO
# 1 10100200 600 3300 2700
# 2 10100201 NA NA 20
# 3 10100202 39 40 1
# 4 10100203 0 40 40
# 5 10100204 NA NA 100
# 6 10100205-A NA NA 18
# 7 10100206 40 70 30
# 8 10100207 0 126 126
# 9 10100208 0 10 10
# 10 10100209 0 10 10
# 11 10100210 0 250 250

R: Split Variable Column into multiple (unbalanced) columns by comma

I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of a few different "categories" that I want to split to where it shows 1 category per column (similar to what split would do in stata). For example:
# Name Age Number Events First
# Karen 24 8 Triathlon/IM,Marathon,10k,5k 0
# Kurt 39 2 Half-Marathon,10k 0
# Leah 18 0 1
And I want it to look like:
# Name Age Number Events_1 Event_2 Events_3 Events_4 First
# Karen 24 8 Triathlon/IM Marathon 10k 5k 0
# Kurt 39 2 Half-Marathon 10k NA NA 0
# Leah 18 0 NA NA NA NA 1
I have looked through stackoverflow but have not found anything that works (everything gives me an error of some sort). Any suggestions would be greatly appreciated.
Note: May not be important but the largest number of categories 1 person has is 19 therefore I would need to create Event_1:Event_19
Comment: Previous stack overflows have suggested the separate function, however this function does not seem to work with my dataset. When I input the function the program runs but when it is finished nothing is changed, there is no output, and no error code. When I tried to use other suggestions made in other threads I received error messages. However, I finally got it is work by using the cSplit function. Thank for the help!!!
From Ananda's splitstackshape package:
cSplit(df, "Events", sep=",")
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Or with tidyr:
separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
# Name Age Number Events_1 Events_2 Events_3 Events_4 First
#1 Karen 24 8 Triathlon/IM Marathon 10k 5k 0
#2 Kurt 39 2 Half-Marathon 10k <NA> <NA> 0
#3 Leah 18 0 NA <NA> <NA> <NA> 1
With the data.table package:
setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Data
df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt",
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L,
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c(" NA",
" Half-Marathon,10k", " Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name",
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA,
-3L))

Subtraction on different rows and columns and separated by group

I really hate to ask two questions in a row but this is something that I can’t wrap my head around. So let’s say I have a data frame, as follows:
df
Row# User Morning Evening Measure Date
1 1 NA NA 2/18/11
2 1 50 115 2/19/11
3 1 85 128 2/20/11
4 1 62 NA 2/25/11
5 1 48 100.8 3/8/11
6 1 19 71 3/9/11
7 1 25 98 3/10/11
8 1 NA 105 3/11/11
9 2 48 105 2/18/11
10 2 28 203 2/19/11
11 2 35 80.99 2/21/11
12 2 91 78.25 2/22/11
Is it possible in R to take the difference between the previous consecutive day (and only the previous day, not the previous result) evening value of 1 row and the morning value of a different row for each user group? So my desired results would be this.
df
Row# User Morning Evening Date Difference
1 1 NA NA 2/18/11 NA
2 1 50 115 2/19/11 NA
3 1 85 129 2/20/11 30
4 1 62 NA 2/25/11 NA
5 1 48 100.8 3/8/11 NA
6 1 19 71 3/9/11 81.8
7 1 25 98 3/10/11 46
8 1 10 105 3/11/11 88
9 2 48 105 2/18/11 NA
10 2 28 203 2/19/11 77
11 2 35 80.99 2/21/11 NA
12 2 91 78.25 2/22/11 -10.01
All I want this to do is to take the morning value and subtract it from the evening value of the previous consecutive day for each user group. As you can see, some parts of my data frame contain NA values in the morning and evening columns, in addition, not all of the dates are in consecutive order for each different user, so naturally, NA should be assigned.
I've tried searching google but there wasn't much information on being able to apply functions to different rows for each group of rows on different columns (if that makes any sense).
My attempts include many variations of this.
df$Difference<-ave((df$Morning,df$Evening),
df$User,
FUN=function(x){
c('NA',diff(df$Evening-df$Morning)),na.rm=T
})
Again, any help would be greatly appreciated. Thanks.
Note: The input data you show and the output data are not the same. There is a NA which is replaced by 10 in output and the last date is 2/14/11 in input and 2/22/11 in output.
I've assumed the output to be the original data to create this answer to match your result.
df$Diff <- c(NA, head(df$Evening, -1) - tail(df$Morning, -1))
df$Diff[which(c(0, diff(as.Date(as.character(df$Measure_Date),
format="%m/%d/%Y"))) != 1)] <- NA
> df
# Row User Morning Evening Measure_Date Diff
# 1 1 1 NA NA 2/18/11 NA
# 2 2 1 50 115.00 2/19/11 NA
# 3 3 1 85 128.00 2/20/11 30.00
# 4 4 1 62 NA 2/25/11 NA
# 5 5 1 48 100.80 3/8/11 NA
# 6 6 1 19 71.00 3/9/11 81.80
# 7 7 1 25 98.00 3/10/11 46.00
# 8 8 1 10 105.00 3/11/11 88.00
# 9 9 2 48 105.00 2/18/11 NA
# 10 10 2 28 203.00 2/19/11 77.00
# 11 11 2 35 80.99 2/21/11 NA
# 12 12 2 91 78.25 2/22/11 -10.01
#user1342086's edit (that got rejected, but was right indeed):
df$Diff[which(diff(df$User) != 0)] <- NA
seems to take care of the grouping by "User".
A blind first shot (untested). Relies on the data frame being already sorted by User and Date.
#if necessary, transform your dates from factor to Date
df$Date <- as.Date(levels(df$Date)[df$Date],format="%m/%d/%y")
df <- within(df,
Difference <- ifelse(c(NA,diff(Measure_Date)) == 1 & diff(User) == 0,
c(NA,head(Evening,-1)) - Morning, NA
)
)
I used plyr, so be sure you have it installed. This solution should work even if user data are mixed (i.e. not in consecutive rows) and dates are not in chronological order.
# Your example data, as you should post it for us to use
df <-
structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), Morning = c(NA, 50L, 85L, 62L, 48L, 19L, 25L, NA, 48L,
28L, 35L, 91L), Evening = c(NA, 115, 128, NA, 100.8, 71, 98,
105, 105, 203, 80.99, 78.25), Measure_Date = structure(c(1L,
2L, 3L, 5L, 9L, 10L, 6L, 7L, 1L, 2L, 4L, 8L), .Label = c("2/18/11",
"2/19/11", "2/20/11", "2/21/11", "2/25/11", "3/10/11", "3/11/11",
"3/14/11", "3/8/11", "3/9/11"), class = "factor")), .Names = c("User",
"Morning", "Evening", "Measure_Date"), class = "data.frame", row.names = c(NA,
-12L))
# As already stated by Arun, you need the date as class Date
df$Measure_Date <- as.Date(df$Measure_Date, format='%m/%d/%y')
# Use plyr to procces the dataframe by user
library(package=plyr)
ddply(.data=df, .variables='User',
.fun=function(x){
# Complete sequence of dates for each user
tdf <- data.frame(Measure_Date=seq(from=min(x$Measure_Date),
to=max(x$Measure_Date),
by='1 day'))
# Merge to fill in NAs for unused dates
tdf <- merge(tdf, x, all=TRUE)
# Put desired values side by side
tdf$Evening <- c(NA, tdf$Evening[-length(tdf$Evening)])
# Diference
tdf$Difference <- tdf$Evening - tdf$Morning
# Return desired value to original data
tdf <- tdf[,c('Measure_Date', 'Difference')]
x <- merge(x, tdf)
x
})

Resources