This question already has answers here:
Sample random rows in dataframe
(13 answers)
Closed 3 years ago.
I am trying to sample 2 columns of a dataframe but the sample function is allowing me only one column to sample not both columns(Campaignid,CampaignName) at once.
Is there a way to sample like I wanted!
camp.d <- data.frame(Campaignid=c(121,132,133,143,153),
CampaignName=c('a','b','c','d','e'))
#allows only one column
a <- sample(camp.d$Campaignid, 100, replace = TRUE)
Expected:
Campaignid CampaignName
121 a
121 a
133 c
132 b
132 b
...
I think you need this -
sampled_data <- camp.d[sample(nrow(camp.d), 100, replace = T), ]
head(sampled_data)
Campaignid CampaignName
2 132 b
5 153 e
3 133 c
3.1 133 c
2.1 132 b
4 143 d
You could use the sample call the slice the full dataframe
camp.d[sample(camp.d$Campaignid, 100), ]
You can try:
as.data.frame(lapply(camp.d, sample, size = 100, replace = TRUE))
Campaignid CampaignName
1 132 a
2 133 c
3 143 a
4 132 e
5 133 c
6 143 a
7 132 c
8 153 a
9 121 c
10 132 b
Related
Instead of looking at the first n rows of a data frame, as head(mydf) does, or the last n as tail(mydf) does, it occurs to me that I would often rather see n evenly-spaced rows, including the first and the last row. For example, if a data frame had 601 rows, this hypothetical function would display row 1, 101, 201, 301, 401, 501, and 601, assuming that 6 is the default number, as it is for head() and tail().
Is there a built-in function of some package that does this, and if not what would be the best way to implement?
For example, for the data frame mydf <- data.frame(name=letters, value=101:126), I would want the output of an alternative to head() called myview() to be something like:
> myview(mydf)
name value
1 a 101
6 f 106
11 k 111
16 p 116
21 u 121
26 z 126
You can directly do this in seq :
looksee <- function(df, n = 6) df[seq(1, nrow(df), length.out = n),]
looksee(mydf)
# name value
#1 a 101
#6 f 106
#11 k 111
#16 p 116
#21 u 121
#26 z 126
looksee(mydf, 10)
# name value
#1 a 101
#3 c 103
#6 f 106
#9 i 109
#12 l 112
#14 n 114
#17 q 117
#20 t 120
#23 w 123
#26 z 126
This is my try at implementing, but it is probably not very robust compared to head()--it will only work for things that nrow() works for, for one thing.
looksee <- function(df, n=6){
q <- seq(0, 1, length.out=n)
n = nrow(df)
rows <- round(quantile(1:n, probs=q))
return(df[rows,])
}
Example usage:
> mydf <- data.frame(name=letters, value=101:126)
> looksee(mydf)
name value
1 a 101
6 f 106
11 k 111
16 p 116
21 u 121
26 z 126
This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 4 years ago.
y <- data.frame(x = c("63,98,131","75,109,145","66,104,139"))
I want to make three columns A,B,C from x by splitting from comma
A B C
63 98 131
75 109 145
66 104 139
I tried to use str_split
str_split(y$x, " , ")
[[1]]
[1] "63,98,131"
[[2]]
[1] "75,109,145"
[[3]]
[1] "66,104,139"
But this does not do the job. How can I fix it?
> dt=as.data.frame(matrix(unlist(strsplit(y$x,",")),ncol=dim(y)[1],byrow = T))
> dt
V1 V2 V3
1 63 98 131
2 75 109 145
3 66 104 139
I have a simple code for matrix
ind1=which(macierz==1,arr.ind = TRUE)
fragment of theresult is
> ind1
row col
TCGA.CH.5737.01 53 1
TCGA.CH.5791.01 66 1
P03.1334.Tumor 322 1
P04.1790.Tumor 327 1
CPCG0340.F1 425 1
TCGA.CH.5737.01 53 2
TCGA.CH.5791.01 66 2
P03.1334.Tumor 322 2
P04.1790.Tumor 327 2
CPCG0340.F1 425 2
I would like to sort it by first column alphabetical. How can I do this in R?
It looks as if ind1 is a matrix and the first column is the rownames, so you probably need something like ind1 <- ind1[order(rownames(ind1)),]
You need (assuming your first column is called "label" and those are not rownames)
ind1[order(ind1$label),]
order() return a list of row indexes after sorting alphabetically the data frame. Just to make the example reproducible I created your data frame so
ind1 <- data.frame ( label = c("TCGA.CH.5737.01", "TCGA.CH.5791.01",
"P03.1334.Tumor","P04.1790.Tumor", "CPCG0340.F1" , "TCGA.CH.5737.01",
"TCGA.CH.5791.01","P03.1334.Tumor", "P04.1790.Tumor", "CPCG0340.F1"),
row = c(53,66,322,327,425,53,66,322,327,425), col =
c(1,1,1,1,1,2,2,2,2,2),
stringsAsFactors = FALSE)
and the output is
> ind1[order(ind1$label),]
label row col
5 CPCG0340.F1 425 1
10 CPCG0340.F1 425 2
3 P03.1334.Tumor 322 1
8 P03.1334.Tumor 322 2
4 P04.1790.Tumor 327 1
9 P04.1790.Tumor 327 2
1 TCGA.CH.5737.01 53 1
6 TCGA.CH.5737.01 53 2
2 TCGA.CH.5791.01 66 1
7 TCGA.CH.5791.01 66 2
Hope that helps.
Regards, Umberto
I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28
I'm attempting to add a column to a data frame that consists of normalized values by a factor.
For example:
'data.frame': 261 obs. of 3 variables:
$ Area : Factor w/ 29 levels "Antrim","Ards",..: 1 1 1 1 1 1 1 1 1 2 ...
$ Year : Factor w/ 9 levels "2002","2003",..: 1 2 3 4 5 6 7 8 9 1 ...
$ Arrests: int 18 54 47 70 62 85 96 123 99 38 ...
I'd like to add a column that are the Arrests values normalized in groups by Area.
The best I've come up with is:
data$Arrests.norm <- unlist(unname(by(data$Arrests,data$Area,function(x){ scale(x)[,1] } )))
This command processes but the data is scrambled, ie, the normalized values don't match to the correct Areas in the data frame.
Appreciate your tips.
EDIT:Just to clarify what I mean by scrambled data, subsetting the data frame after my code I get output like the following, where the normalized values clearly belong to another factor group.
Area Year Arrests Arrests.norm
199 Larne 2002 92 -0.992843957
200 Larne 2003 124 -0.404975825
201 Larne 2004 89 -1.169204397
202 Larne 2005 94 -0.581336264
203 Larne 2006 98 -0.228615385
204 Larne 2007 8 0.006531868
205 Larne 2008 31 0.418039561
206 Larne 2009 25 0.947120880
207 Larne 2010 22 2.005283518
Following up your by attempt:
df <- data.frame(A = factor(rep(c("a", "b"), each = 4)),
B = sample(1:4, 8, TRUE))
ll <- by(data = df, df$A, function(x){
x$B_scale <- scale(x$B)
x
}
)
df2 <- do.call(rbind, ll)
data <- transform(data, Arrests.norm = ave(Arrests, Area, FUN = scale))
will do the trick.