Convert comma separated string to numeric columns - r

I have a dataset with several columns, one of which is a column for reaction times. These reaction times are comma separated to denote the reaction times (of the same participant) for the different trials.
For example: row 1 (i.e.: the data from participant 1) has the following under the column "reaction times"
reaction_times
2000,1450,1800,2200
Hence these are the reaction times of participant 1 for trials 1,2,3,4.
I now want to create a new data set in which the reaction times for these trials all form individual columns. This way I can calculate the mean reaction time for each trial.
trial 1 trial 2 trial 3 trial 4
participant 1: 2000 1450 1800 2200
I tried the colsplit from the reshape2 package but that doesn't seem to split my data into new columns (perhaps because my data is all in 1 cell).
Any suggestions?

I think you are looking for the strsplit() function;
a = "2000,1450,1800,2200"
strsplit(a, ",")
[[1]]
[1] "2000" "1450" "1800" "2200"
Notice that strsplit returns a list, in this case with only one element. This is because strsplit takes vectors as input. Therefore, you can also put a long vector of your single cell characters into the function and get back a splitted list of that vector. In a more relevant example this look like:
# Create some example data
dat = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
splitdat = do.call("rbind", strsplit(dat$reaction_time, ","))
splitdat = data.frame(apply(splitdat, 2, as.numeric))
names(splitdat) = paste("trial", 1:4, sep = "")
head(splitdat)
trial1 trial2 trial3 trial4
1 597 1071 1430 997
2 614 322 1242 1140
3 1522 1679 51 1120
4 225 1988 1938 1068
5 621 623 1174 55
6 1918 1828 136 1816
and finally, to calculate the mean per person:
apply(splitdat, 1, mean)
[1] 1187.50 361.25 963.75 1017.00 916.25 1409.50 730.00 1310.75 1133.75
[10] 851.25 914.75 881.25 889.00 1014.75 676.75 850.50 805.00 1460.00
[19] 901.00 1443.50 507.25 691.50 1090.00 833.25 669.25

A nifty, if rather heavy-handed, way is to use read.csv in conjunction with textConnection. Assuming your data is in a data frame, df:
x <- read.csv(textConnection(df[["reaction times"]]))

Old question, but I came across it from another recent question (which seems unrelated).
Both existing answers are appropriate, but I wanted to share an answer related to a package I have created called "splitstackshape" that is fast and has straightforward syntax.
Here's some sample data:
set.seed(1)
dat = data.frame(
reaction_time = apply(matrix(round(
runif(24, 1, 2000)), 6, 4), 1, paste, collapse = ","))
This is the splitting:
library(splitstackshape)
cSplit(dat, "reaction_time", ",")
# reaction_time_1 reaction_time_2 reaction_time_3 reaction_time_4
# 1: 532 1889 1374 761
# 2: 745 1322 769 1555
# 3: 1146 1259 1540 1869
# 4: 1817 125 996 425
# 5: 404 413 1436 1304
# 6: 1797 354 1984 252
And, optionally, if you need to take the rowMeans:
rowMeans(cSplit(dat, "reaction_time", ","))
# [1] 1139.00 1097.75 1453.50 840.75 889.25 1096.75

Another option using dplyr and tidyr with Paul Hiemstra's example data is:
# create example data
data = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
head(data)
# clean data
data2 <- data %>% mutate(split_reaction_time = str_split(as.character(reaction_time), ",")) %>% unnest(split_reaction_time)
data2$col_names <- c("trial1", "trial2", "trial3", "trial4")
data2 <- data2 %>% spread(key = col_names, value = split_reaction_time) %>% select(-reaction_time)
head(data2)

Related

How to convert JSON (with uneven lists with few values missing) into dataframe in R (example given below)?

This is my JSON file json_data as given below. It has lists with different length of columns(16 & 15). I want to create 1 data frame given lists data in JSON . How do I solve this ?
data2
[1] "{\"2020-01-01\":
[{\"contactsPerPageview\":0.015706806282722512,\"returningVisits\":30,\"rawViews\":191
,\"standardViews\":191,\"sessionToContactRate\":0.020689655172413793
,\"pageviewsPerSession\":1.3172413793103448,\"visits\":145,\"visitors\":115
,\"submissionsPerPageview\":0.015706806282722512,\"pageviewsMinusExits\":191
,\"submissions\":3,\"leads\":3,\"leadsPerView\":0.015706806282722512,\"contacts\":3
,\"newVisitorSessionRate\":0.7931034482758621}]
,\"2020-01-02\":[{\"contactsPerPageview\":0.007963594994311717
,\"returningVisits\":71,\"rawViews\":879,\"subscribers\":4,\"standardViews\":879
,\"sessionToContactRate\":0.012704174228675136,\"pageviewsPerSession\":1.5952813067150635
,\"visits\":551,\"visitors\":480,\"submissionsPerPageview\":0.007963594994311717
,\"pageviewsMinusExits\":879,\"submissions\":7,\"leads\":3
,\"leadsPerView\":0.0034129692832764505,\"contacts\":7
,\"newVisitorSessionRate\":0.8711433756805808}]}"
This is what I tried
df <- lapply(data2, function(play)
{ data.frame(matrix(unlist(play), ncol=16, byrow=T))
})
df <- do.call(rbind, df)
colnames(df) <- names(data2[[1]][[1]])
You can try rbindlist from data.table with fill parameter set to TRUE.
js <- "{\"2020-01-01\":
[{\"contactsPerPageview\":0.015706806282722512,\"returningVisits\":30,\"rawViews\":191
,\"standardViews\":191,\"sessionToContactRate\":0.020689655172413793
,\"pageviewsPerSession\":1.3172413793103448,\"visits\":145,\"visitors\":115
,\"submissionsPerPageview\":0.015706806282722512,\"pageviewsMinusExits\":191
,\"submissions\":3,\"leads\":3,\"leadsPerView\":0.015706806282722512,\"contacts\":3
,\"newVisitorSessionRate\":0.7931034482758621}]
,\"2020-01-02\":[{\"contactsPerPageview\":0.007963594994311717
,\"returningVisits\":71,\"rawViews\":879,\"subscribers\":4,\"standardViews\":879
,\"sessionToContactRate\":0.012704174228675136,\"pageviewsPerSession\":1.5952813067150635
,\"visits\":551,\"visitors\":480,\"submissionsPerPageview\":0.007963594994311717
,\"pageviewsMinusExits\":879,\"submissions\":7,\"leads\":3
,\"leadsPerView\":0.0034129692832764505,\"contacts\":7
,\"newVisitorSessionRate\":0.8711433756805808}]}"
a<- jsonlite::fromJSON(js)
rbindlist(a,fill = T)
Gives:
contactsPerPageview returningVisits rawViews standardViews sessionToContactRate pageviewsPerSession visits visitors submissionsPerPageview pageviewsMinusExits submissions leads leadsPerView
1: 0.015706806 30 191 191 0.02068966 1.317241 145 115 0.015706806 191 3 3 0.015706806
2: 0.007963595 71 879 879 0.01270417 1.595281 551 480 0.007963595 879 7 3 0.003412969
contacts newVisitorSessionRate subscribers
1: 3 0.7931034 NA
2: 7 0.8711434 4

return output as columns instead of list after applying a function using dplyr [duplicate]

I have a dataset with several columns, one of which is a column for reaction times. These reaction times are comma separated to denote the reaction times (of the same participant) for the different trials.
For example: row 1 (i.e.: the data from participant 1) has the following under the column "reaction times"
reaction_times
2000,1450,1800,2200
Hence these are the reaction times of participant 1 for trials 1,2,3,4.
I now want to create a new data set in which the reaction times for these trials all form individual columns. This way I can calculate the mean reaction time for each trial.
trial 1 trial 2 trial 3 trial 4
participant 1: 2000 1450 1800 2200
I tried the colsplit from the reshape2 package but that doesn't seem to split my data into new columns (perhaps because my data is all in 1 cell).
Any suggestions?
I think you are looking for the strsplit() function;
a = "2000,1450,1800,2200"
strsplit(a, ",")
[[1]]
[1] "2000" "1450" "1800" "2200"
Notice that strsplit returns a list, in this case with only one element. This is because strsplit takes vectors as input. Therefore, you can also put a long vector of your single cell characters into the function and get back a splitted list of that vector. In a more relevant example this look like:
# Create some example data
dat = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
splitdat = do.call("rbind", strsplit(dat$reaction_time, ","))
splitdat = data.frame(apply(splitdat, 2, as.numeric))
names(splitdat) = paste("trial", 1:4, sep = "")
head(splitdat)
trial1 trial2 trial3 trial4
1 597 1071 1430 997
2 614 322 1242 1140
3 1522 1679 51 1120
4 225 1988 1938 1068
5 621 623 1174 55
6 1918 1828 136 1816
and finally, to calculate the mean per person:
apply(splitdat, 1, mean)
[1] 1187.50 361.25 963.75 1017.00 916.25 1409.50 730.00 1310.75 1133.75
[10] 851.25 914.75 881.25 889.00 1014.75 676.75 850.50 805.00 1460.00
[19] 901.00 1443.50 507.25 691.50 1090.00 833.25 669.25
A nifty, if rather heavy-handed, way is to use read.csv in conjunction with textConnection. Assuming your data is in a data frame, df:
x <- read.csv(textConnection(df[["reaction times"]]))
Old question, but I came across it from another recent question (which seems unrelated).
Both existing answers are appropriate, but I wanted to share an answer related to a package I have created called "splitstackshape" that is fast and has straightforward syntax.
Here's some sample data:
set.seed(1)
dat = data.frame(
reaction_time = apply(matrix(round(
runif(24, 1, 2000)), 6, 4), 1, paste, collapse = ","))
This is the splitting:
library(splitstackshape)
cSplit(dat, "reaction_time", ",")
# reaction_time_1 reaction_time_2 reaction_time_3 reaction_time_4
# 1: 532 1889 1374 761
# 2: 745 1322 769 1555
# 3: 1146 1259 1540 1869
# 4: 1817 125 996 425
# 5: 404 413 1436 1304
# 6: 1797 354 1984 252
And, optionally, if you need to take the rowMeans:
rowMeans(cSplit(dat, "reaction_time", ","))
# [1] 1139.00 1097.75 1453.50 840.75 889.25 1096.75
Another option using dplyr and tidyr with Paul Hiemstra's example data is:
# create example data
data = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
head(data)
# clean data
data2 <- data %>% mutate(split_reaction_time = str_split(as.character(reaction_time), ",")) %>% unnest(split_reaction_time)
data2$col_names <- c("trial1", "trial2", "trial3", "trial4")
data2 <- data2 %>% spread(key = col_names, value = split_reaction_time) %>% select(-reaction_time)
head(data2)

Creating multiple functions with dplyr non-standard evaluation

I am trying to use mutate_() to create multiple columns where each is based on a custom function called with different inputs. I can use paste() to create multiple quoted function calls, but this doesn't work because dplyr's NSE requires formulas (~) rather than quoted strings to be able to find the function. How can I write the "dots = " line below so that the function can be found? I tried experimenting with ~, as.formula(), and lazyeval::interp(), but couldn't get any to work. My actual "prefixes" is a long vector so I don't want to separately write out the function calls for each new column. Thanks
library(dplyr)
library(lazyeval)
library(nycflights13)
myfunc = function(x, y) { x - y }
# this works
flights1 <- mutate(flights, dep_time_sched = myfunc(dep_time, dep_delay),
arr_time_sched = myfunc(arr_time, arr_delay))
# this doesn't - Error: could not find function "myfunc"
prefixes <- c('dep', 'arr')
dots = as.list(paste0('myfunc(',
paste0(prefixes, '_time'), ', ',
paste0(prefixes, '_delay)')))
flights2 <- mutate_(flights, .dots = setNames(dots, paste0(prefixes, '_time_sched')))
You could approach this by using interp with lapply to loop through your prefixes and get a list in the desired format for mutate_.
dots = lapply(prefixes, function(var) interp(~myfunc(x, y),
.values = list(x = as.name(paste0(var, "_time")),
y = as.name(paste0(var, "_delay")))))
dots
[[1]]
~myfunc(dep_time, dep_delay)
<environment: 0x0000000019e51f00>
[[2]]
~myfunc(arr_time, arr_delay)
<environment: 0x0000000019f1e5b0>
This gives the same results as your flights1.
flights2 = mutate_(flights, .dots = setNames(dots, paste0(prefixes, '_time_sched')))
identical(flights1, flights2)
[1] TRUE
My actual "prefixes" is a long vector so I don't want to separately write out the function calls for each new column.
If that's the case, you should really transform your data to long format. To clarify what I mean, let's look at a smaller example:
mydat <- flights[1:5, c(paste0(prefixes,"_time"), paste0(prefixes,"_delay"))]
# dep_time arr_time dep_delay arr_delay
# (int) (int) (dbl) (dbl)
# 1 517 830 2 11
# 2 533 850 4 20
# 3 542 923 2 33
# 4 544 1004 -1 -18
# 5 554 812 -6 -25
library(data.table)
longdat <- setDT(mydat)[, .(
pref = rep(prefixes, each=.N),
time = unlist(mget(paste0(prefixes,"_time"))),
delay = unlist(mget(paste0(prefixes,"_delay")))
)]
longdat[, time_sched := myfunc(time, delay) ]
# pref time delay time_sched
# 1: dep_ 517 2 515
# 2: dep_ 533 4 529
# 3: dep_ 542 2 540
# 4: dep_ 544 -1 545
# 5: dep_ 554 -6 560
# 6: arr_ 830 11 819
# 7: arr_ 850 20 830
# 8: arr_ 923 33 890
# 9: arr_ 1004 -18 1022
# 10: arr_ 812 -25 837
Besides being simpler, calling the function a single time takes advantage of its vectorization.
While I used data.table to construct longdat, I'm sure there's a tool to do the same thing in the tidyr package (companion to dplyr). Similarly, the addition of the time_sched column is just a mutate.
Alternative ways of reshaping Thanks to #akrun, here is another way to get to longdat, using melt function syntax that will be available in the next version of data.table (1.9.8, not released yet):
longdat <- melt(mydat,
measure = patterns('time$','delay$'),
variable.name = "pref",
value.name = c('time', 'delay')
)[, pref := prefixes[pref]]
or, also thanks to #akrun, here is a way to reshape that automatically constructs the prefixes, given the suffixes (time and delay), using #AnandaMahto's splitstackshape package:
library(splitstackshape)
longdat <- merged.stack(transform(mydat, ind=1:nrow(mydat)),
var.stubs = c('_time', '_delay'),
sep = 'var.stubs',
atStart = FALSE)

Using apply() and its correlaries to replace loops to apply outlier test (e.g.)

I have data from a behavioral task that looks something like this (assuming data frame named data):
data <- data.frame(subject = c(rep(8666, 6), rep(5452, 6)), RT = c(714, 877, 665, 854, 1092, 1960, 770, 4551, 1483, 1061, 755, 1090))
data
subject RT
8666 714
8666 877
8666 665
8666 854
8666 1092
8666 1960
5452 770
5452 4551
5452 1483
5452 1061
5452 755
5452 1090
That is, for this question, I'm working with a selection of subjects and reaction times. (All told, 183 subjects with 156 trials each.) Using reshape's cast() function, I've calculated a value for each subject that I'd like to use to exclude certain trials.
outl <- function(x) {
2.5 * mad(x) + median(x)
}
melteddata <- melt(data, id.vars="subject", measure.vars = "RT")
outliers <- cast(melteddata, subject ~ ., outl)
colnames(outliers)[2] <- "outlier"
This outputs something like:
subject outlier
1 5452 2235.635
2 8666 1517.844
...
Now, the way I'd normally do this is to write a loop which, for each unique subject number, compares their RT to the outlier value for that subject:
data$outliers <- 0
for(subject in unique(data$subject)) {
temp <- data[data$subject == subject,]
temp$outliers <- ifelse(temp$RT > outliers[outliers$subject == subject,]$outlier, 0, 1)
data[data$subject == subject,]$outliers <- temp$outliers
}
... which marks the RTs of 1960 for subject 8666 and 4551 for 5452 as outliers.
However, I feel like there's got to be a more R way to do this. It feels like apply() should be able to do the same thing, and certainly this takes a long time to run as a loop. Any suggestions?
Edit:
I realize I can do this with ddply() from the library(plyr) package instead of using melt() and cast():
library(plyr)
outliers <- ddply(data, .(subject), summarize, median = median(RT), mad = mad(RT), outlier = median(RT) + 2.5 * mad(RT))
Here's a try. Turn the outliers data frame into a named vector:
out <- outliers$outlier
names(out) <- outliers$subject
Then use it as a lookup table to select all the rows of data where the RT column is less than the outlier value for the subject:
data[data$RT < out[as.character(data$subject)], ]
The as.character is necessary since the subject IDs are integers, and you don't want to get, e.g., the 8666th element of out.
Edit to add a dplyr solution:
group_by(data, subject) %>% summarize(outlier = 2.5 * mad(RT) + median(RT)) -> outliers
merge(data, outliers)
filter(data, RT < outlier)

How to work with all subsets in vectorized way

I have a stock price dataframe containing a lot of symbols and I would like to perform operations on subsets for every symbol in a vectorized way. My data is :
head(dataset)
date open high low close volume symbol
1 2014-08-29 34.59 34.6800 34.59 34.6800 200 AAIT
2 2014-08-28 34.96 34.9600 34.96 34.9600 211 AAIT
3 2014-08-27 35.28 35.2800 35.28 35.2800 507 AAIT
4 2014-08-26 35.02 35.0200 35.02 35.0200 00 AAIT
5 2014-08-25 34.57 35.0200 34.57 35.0200 385 AAIT
6 2014-08-22 34.80 34.8299 34.80 34.8299 802 AAIT
For every symbol I would like to do something like that :
for (symb in unique(dataset$symbol){
dataset$night = with(subset(dataset, dataset$symbol == symb), open[-length(open)]-close[-1])
}
This causes the last row to be filled with NA so I can't do that on the whole dataframe. I could replace the last line afterwards but I would prefer to work with the subsets for more convenience. Is it possible to do the for loop in a vectorized way (for loops are very slow on r, it can become a problem if I have too many symbols)
You could use dplyr:
library(dplyr)
dataset <- dataset %>%
group_by(symbol) %>%
mutate(night = c(head(open, -1) - tail(close, -1), NA))
or plyr:
library(plyr)
dataset <- ddply(dataset, .(symbol), mutate,
night = c(head(open, -1) - tail(close, -1), NA))
or data.table:
library(data.table)
dt <- data.table(dataset)
setkey(dt, symbol)
dt[, night := c(head(open, -1) - tail(close, -1), NA), by = symbol]

Resources