Subsetting a data frame by a value of one of its colums - r

I have a rather large data frame. Here is a simplified example:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
2 CBA 14 Good
2 FDA 14 Good
3 JHA 16 Good
3 AHF 16 Good
3 AKF 17 Good
Here it is as a dput:
dat <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), Element = structure(c(1L,
2L, 5L, 6L, 7L, 8L, 3L, 4L), .Label = c("AAA", "ABA", "AHF",
"AKF", "AVA", "CBA", "FDA", "JHA"), class = "factor"), Value = c(11L,
12L, 13L, 14L, 14L, 16L, 16L, 17L), Note = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Good", class = "factor")), .Names = c("Group",
"Element", "Value", "Note"), class = "data.frame", row.names = c(NA,
-8L))
I'm trying to separate it based on the group. so let's say
Group 1 will be a data frame:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
Group 2:
2 CBA 14 Good
2 FDA 14 Good
and so on.

You can use split for this.
> dat
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
## 6 3 JHA 16 Good
## 7 3 AHF 16 Good
## 8 3 AKF 17 Good
> x <- split(dat, dat$Group)
Then you can access each individual data frame by group number with x[[1]], x[[2]], etc.
For example, here is group 2:
> x[[2]] ## or x[2]
## Group Element Value Note
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
ADD: Since you asked about it in the comments, you can write each individual data frame to file with write.csv and lapply. The invisible wrapper is simply to suppress the output of lapply
> invisible(lapply(seq(x), function(i){
write.csv(x[[i]], file = paste0(i, ".csv"), row.names = FALSE)
}))
We can see that the files were created by looking at list.files
> list.files(pattern = "^[0-9].csv")
## [1] "1.csv" "2.csv" "3.csv"
And we can see the data frame of the third group with read.csv
> read.csv("3.csv")
## Group Element Value Note
## 1 3 JHA 16 Good
## 2 3 AHF 16 Good
## 3 3 AKF 17 Good

Obligatory plyr version (pretty much equiv to Richard's, but I'll bet it's slower, too:
library(plyr)
groups <- dlply(dat, .(Group), function(x) { return(x) })
length(groups)
## [1] 3
groups$`1` # can also do groups[[1]]
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
groups[[2]]
## Group Element Value Note
## 1 2 CBA 14 Good
## 2 2 FDA 14 Good

Related

how to remove part of a string without interrupting a data frame?

I have a data looks like this but way much bigger
df<- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
as an example, I am trying to remove -1 from all strings of the first column
I can do this with
as.data.frame(str_remove_all(df$names, "-1"))
the problem is that it will remove all other columns as well.
I dont want to split the data and merge again because I am afraid I Make a mismatch
Is there anyway without interrupting, just getting raid of specific strings?
for instance the output should looks like this
names Col col2
bests 1 2
trible 2 4
crazy NA 5
cool 4 7
nonsense 47 9
Mean 294 9
Lose 2 0
Try 1 2
Trified 3 3
Using gsub, escape the special \\-, and $ for end of string.
transform(df, names=gsub('\\-1$', '', names))
# names Col col2
# 1 bests 1 2
# 2 trible 2 4
# 3 crazy NA 5
# 4 cool 4 7
# 5 nonsense 47 9
# 6 Mean 294 9
# 7 Lose 2 0
# 8 Trye 1 2
# 9 Trified 3 3
Data:
df <- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
Using stringr package,
df$names = str_remove_all(df$names, '-1')
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3
We could use trimws from base R
df$names <- trimws(df$names, whitespace = "-\\d+")
-output
> df
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3

Can reshape in base R turn more than one time var into a single columns in long format?

I can reshape the part of my columns having the same 'name stem' opg.1 through opg.10, but when I present the last two 'time' variables, mkd.1 and mkd.2, I get the following error:
Fejl i reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
'varying' arguments must be the same length
In short, my question is, will renaming mkd.1 and mkd.2 to have the same opg name stem remove the error, and in case it will, why?
My code is
gdata <- termin.test[sel.cols]
names(gdata) <- c( "opg.1", "opg.2","opg.3","opg.4","opg.5", "opg.61",
"opg.62","opg.7","opg.8","opg.9","opg.10",
"navn",
"mkd.11","mkd.12" )
head(gdata)
# opg.1 opg.2 opg.3 opg.4 opg.5 opg.61 opg.62 opg.7 opg.8 opg.9 opg.10
# 1 2 2 0 0 1 0 10 4 5 10 3
# 2 0 1 0 0 2 2 5 5 2 8 1
# 3 1 0 0 0 0 0 7 3 3 7 4
# 4 0 0 0 0 0 2 7 4 8 10 7
# 5 8 2 3 4 7 3 11 12 10 8 16
# 6 1 2 1 1 2 2 5 2 2 3 6
# navn mkd.11 mkd.12
# 1 Czzzzzzz 5 24
# 2 Xxxxxx A 2 16
# 3 Cccccc B 1 17
# 4 Christian 0 26
# 5 Emil Xxxx 16 33
# 6 Aaaaa-Sss 4 11
So far so good. But here, my varying= parameter turns me down.
I wanted the variables opg.1-opg.10 and the final two mkd.11 and mkd.12.
redata <- reshape(
# De første 11 kolonner er opgave-kryd-optællinger + nr 12: Elevens navn
gdata, # [,1:12],
direction = "long",
varying=c(1:11,13,14), # Works problem free with varying = 1:11
timevar = "opgave", #
# Vektor OPGAVER er defineret med opgavenavne ovenfor ????
times = opgaver
)
I have a hypothesis that it will work to rename mkd.11 -> opg.11. But I post the question, because I would like to (1) get into base R and (2) comprehend what I am doing. I looked up the question What code does a task like the reshape2 package in a base reshape function? but did not find neither a matching problem posed nor answers relevant to my question.
Edit
Rephrasing the question as I need a single numerical column in the long format reshaped data frame.
The reshape function needs the "varying" argument to have balanced and consistent names. opg has 11 items and mkd has only 2.
I need a single numerical column in the long format reshaped data
frame.
Then rename the two mkd variables to opg.11 and opg.12 before reshaping (as you did).
names(gdata)[13:14] <- c("opg.11","opg.12")
reshape(gdata,
direction = "long",
varying=c(1:11,13,14),
timevar = "opgave"
) # we don't have your `opgaver` object
navn opgave opg id
1.1 Czzzzzzz 1 2 1
2.1 Xxxxxx A 1 0 2
3.1 Cccccc B 1 1 3
4.1 Christian 1 0 4
5.1 Emil Xxxx 1 8 5
6.1 Aaaaa-Sss 1 1 6
...
1.12 Czzzzzzz 12 24 1
2.12 Xxxxxx A 12 16 2
3.12 Cccccc B 12 17 3
4.12 Christian 12 26 4
5.12 Emil Xxxx 12 33 5
6.12 Aaaaa-Sss 12 11 6
If your output is a boxplot, then modify the labels in the command to draw it, or you can convert the opgave variable into a factor with the appropriate labels.
I can see that when I use the suggestion of #akrun, I get two columns opg and mkd in the reshaped data frame, and as there are 11 opg-columns and only 2 mkd-cols the reason for the mentinoed error message is evident: in my data set, I end up with
> melt(setDT(gdata), measure = patterns('^opg\\.\\d+$', '^mkd\\.\\d+$'),
+ value.name = c('opg', 'mkg'), variable.name = 'opgave')
# navn opgave opg mkg
# 1: Czzzzzzz 1 2 5
# 2: Caroline Cxxxx 1 0 2
# 3: Crrrrrrr Rrrrr 1 1 1
# 4: Christian 1 0 0
# 5: Emil Zzzz Cccc 1 8 16
# ---
#238: Owiler 11 8 NA
#239: Sarah 11 5 NA
#240: Bang Bang 11 10 NA
#241: Thhhhh 11 2 NA
#242: William B 11 6 NA
The NA values in the mkg column show that there are fewer variables of this type. This is not as intended. Therefore I stick to the same-name-stem option:
gdata <- termin.test[sel.cols]
names(gdata) <- c( "opg.1", "opg.2","opg.3","opg.4","opg.5", "opg.61",
"opg.62","opg.7","opg.8","opg.9","opg.10",
"navn",
"opg.11","opg.12" )
redata <- reshape(
# De første 11 kolonner er opgave-kryd-optællinger + nr 12: Elevens navn
gdata, # [,1:12],
direction = "long",
varying=c(1:11,13,14), # De første 11 kolonner skal "vendes"
timevar = "opgave", #
# Vektor OPGAVER er defineret med opgavenavne ovenfor ????
times = opgaver
)
This solution works in my further processing in the diagram shown below using with geom_boxplot(), and I can live with the names of the two latter columns, or renaming them in the factored variable opgave is beyond the scope of this question.
If we want to rename the 'mkd' to 'opg'
library(ggplot2)
library(stringr)
library(dplyr)
library(tidyr)
gdata %>%
rename_at(vars(starts_with('mkd')), ~ str_replace(., 'mkd', 'opg')) %>%
pivot_longer(cols = -navn, names_to = 'opgave', values_to = 'value') %>%
ggplot(aes(x =opgave, y = value)) +
geom_boxplot()
data
gdata <- structure(list(opg.1 = c(2L, 0L, 1L, 0L, 8L, 1L), opg.2 = c(2L,
1L, 0L, 0L, 2L, 2L), opg.3 = c(0L, 0L, 0L, 0L, 3L, 1L), opg.4 = c(0L,
0L, 0L, 0L, 4L, 1L), opg.5 = c(1L, 2L, 0L, 0L, 7L, 2L), opg.61 = c(0L,
2L, 0L, 2L, 3L, 2L), opg.62 = c(10L, 5L, 7L, 7L, 11L, 5L), opg.7 = c(4L,
5L, 3L, 4L, 12L, 2L), opg.8 = c(5L, 2L, 3L, 8L, 10L, 2L), opg.9 = c(10L,
8L, 7L, 10L, 8L, 3L), opg.10 = c(3L, 1L, 4L, 7L, 16L, 6L), navn = c("Czzzzzzz",
"Xxxxxx A", "Cccccc B", "Christian", "Emil Xxxx", "Aaaaa-Sss"
), mkd.11 = c(5L, 2L, 1L, 0L, 16L, 4L), mkd.12 = c(24L, 16L,
17L, 26L, 33L, 11L)), class = "data.frame", row.names = c(NA,
-6L))

R replace the column name by the dataframe name with a loop

I am very new to programming with R, but I am trying to replace the column name by the dataframe name with a for loop. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price).
For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to replace the name of the second column ("Close") by the name of the dataframe ("BTC.USD")
For this case I used the following code:
colnames(BTC.USD)[2] <-deparse(substitute(BTC.USD))
And this code works as I imagined:
> head(BTC.USD)
# A tibble: 6 x 2
Date BTC.USD
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
Now I am trying to create a loop to change the second column name for all 25 dataframes of cryptocurrency data:
df_list <- ls(pattern="USD")
for(i in df_list){
aux <- get(i)
(colnames(aux)[2] =df_list)
assign(i,aux)
}
But the code does not work as I thought. Can someone help me figure out what step I am missing?
Thanks in advance!
You can use Map to assign the names, i.e.
Map(function(x, y) {names(x)[2] <- y; x}, l2, names(l2))
#$`a`
# v1 a
#1 3 8
#2 5 6
#3 2 7
#4 1 5
#5 4 4
#$b
# v1 b
#1 9 47
#2 18 48
#3 17 6
#4 5 25
#5 13 12
DATA
dput(l2)
list(a = structure(list(v1 = c(3L, 5L, 2L, 1L, 4L), v2 = c(8L,
6L, 7L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L
)), b = structure(list(v1 = c(9L, 18L, 17L, 5L, 13L), v2 = c(47L,
48L, 6L, 25L, 12L)), class = "data.frame", row.names = c(NA,
-5L)))

reshaping a dataframe for prediction

I just picked up on the package reshape today and I'm having some trouble to understand how it works.
I have the following dataframe:
name workoutnum time weight raceid final position
tommy 1 12 140 1 2
tommy 2 14 140 1 2
tommy 3 11 140 1 2
sarah 1 10 115 1 1
sarah 2 10 115 1 1
sarah 3 11 115 1 1
sarah 4 15 115 1 1
How would I put all this in one row? So the dataframe would look like:
name workoutnum1 workoutnum2 workoutnum3 workoutnum4 time1 time2 time3 time4 weight raceid final_position
tommy 1 1 1 0 12 14 11 NA 140 1 2
sarah 1 1 1 1 10 10 11 15 115 1 1
So all columns would be attached to the workout values.
Is this even the proper way to do it?
reshape seems like a natural part of what you want to do, but won't get you all the way there.
Here's a reshape2 approach that fully melts the data, then casts it back to data.frame, with some tweaks along the way to get the desired output.
Note that in the call to melt(), the variables in the id.vars arguments will remain wide. Then in dcast(), the variable that'll be cast wide is on the RHS of the ~.
library(reshape2)
library(dplyr)
# fully melt the data
d_melt <- melt(d, id.vars = c("name", "raceid", "position", "weight"))
# index the variables within name and variable
d_melt <- d_melt %>%
group_by(name, variable) %>%
mutate(i = row_number(),
wide_variable = paste0(variable, i))
# cast as wide
d_wide <- dcast(d_melt, name + raceid + position + weight ~ wide_variable, value.var = "value")
# replace the workoutnum indices with indicators for missingness
d_wide %>% mutate_each(funs(ifelse(!is.na(.), 1L, 0L)), matches("workoutnum\\d"))
# name raceid position weight time1 time2 time3 time4 workoutnum1 workoutnum2
# 1 sarah 1 1 115 10 10 11 15 1 1
# 2 tommy 1 2 140 12 14 11 NA 1 1
# workoutnum3 workoutnum4
# 1 1 1
# 2 1 0
Data:
structure(list(name = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("sarah", "tommy"), class = "factor"), workoutnum = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), time = c(12L, 14L, 11L, 10L, 10L, 11L, 15L), weight = c(140L, 140L, 140L, 115L, 115L, 115L, 115L), raceid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), position = c(2L, 2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("name", "workoutnum", "time", "weight", "raceid", "position"), class = "data.frame", row.names = c(NA, -7L))
Here's an approach using dcast from "data.table", which reshapes a little more like the reshape function in base R.
The only change I've made to the data is the inclusion of another "time" variable though, as pointed out by #rawr in the comments, it almost seems like your "workoutnum" is the time variable.
I've used getanID from my "splitstackshape" package to generate the "time" variable, but you can create this variable in many different ways.
library(splitstackshape)
dcast(getanID(mydf, c("name", "raceid", "final_position")),
name + raceid + final_position ~ .id,
value.var = c("workoutnum", "time", "weight"))
## name raceid final_position workoutnum_1 workoutnum_2 workoutnum_3
## 1: sarah 1 1 1 2 3
## 2: tommy 1 2 1 2 3
## workoutnum_4 time_1 time_2 time_3 time_4 weight_1 weight_2 weight_3 weight_4
## 1: 4 10 10 11 15 115 115 115 115
## 2: NA 12 14 11 NA 140 140 140 NA
If you're using getanID, you can also use reshape like this:
reshape(getanID(mydf, c("name", "raceid", "final_position")),
idvar = c("name", "raceid", "final_position"), timevar = ".id",
direction = "wide")
## name raceid final_position workoutnum.1 time.1 weight.1 workoutnum.2 time.2
## 1: tommy 1 2 1 12 140 2 14
## 2: sarah 1 1 1 10 115 2 10
## weight.2 workoutnum.3 time.3 weight.3 workoutnum.4 time.4 weight.4
## 1: 140 3 11 140 NA NA NA
## 2: 115 3 11 115 4 15 115
but dcast would be more efficient in general.

Remove duplicated 2 columns permutations

I can't find a good title for this question so feel free to edit it please.
I have this data.frame
section time to from
1 a 9 1 2
2 a 9 2 1
3 a 12 2 3
4 a 12 2 4
5 a 12 3 2
6 a 12 3 4
7 a 12 4 2
8 a 12 4 3
I want to remove duplicated rows that have the same to and from simultaneously, without computing permutations of the 2 columns: e.g (1,2) and (2,1) are duplicated.
So final output would be:
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
I have a solution by constructing a new column key e.g
key <- paste(min(to,from),max(to,from))
and remove duplicated key using duplicated, but I think this is dirty solution.
here the dput of my data
structure(list(section = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "a", class = "factor"), time = c(9L, 9L, 12L,
12L, 12L, 12L, 12L, 12L), to = c(1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L), from = c(2L, 1L, 3L, 4L, 2L, 4L, 2L, 3L)), .Names = c("section",
"time", "to", "from"), row.names = c(NA, -8L), class = "data.frame")
mn <- pmin(s$to, s$from)
mx <- pmax(s$to, s$from)
int <- as.numeric(interaction(mn, mx))
s[match(unique(int), int),]
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
Credit for the idea goes to this question: Remove consecutive duplicates from dataframe and specifically #MatthewPlourde's answer.
You can try using sort within the apply function to order the combinations.
mydf[!duplicated(t(apply(mydf[3:4], 1, sort))), ]
# section time to from
# 1 a 9 1 2
# 3 a 12 2 3
# 4 a 12 2 4
# 6 a 12 3 4

Resources