R "melt-cast" like operation - r

I have a file contains contents like this:
name: erik
age: 7
score: 10
name: stan
age:8
score: 11
name: kyle
age: 9
score: 20
...
As you can see, each record actually contains 3 rows in the file. I am wondering how can I read in the file and transform into data dataframe looks like below:
name age score
erik 7 10
stan 8 11
kyle 9 20
...
What I have done so far(thanks tcash21):
> data <- read.table(file.choose(), header=FALSE, sep=":", col.names=c("variable", "value"))
> data
variable value
1 name erik
2 age 7
3 score 10
4 name stan
5 age 8
6 score 11
7 name kyle
8 age 9
9 score 20
I am thinking how can I split the column into two columns by : and then maybe use something similar like cast in reshape package to do what I want?
or how can I get the rows that has index number 1,4,7,... only, which has a constant step
Thanks!

Another possibility:
library(reshape2)
df$id <- rep(1:(nrow(df)/3), each = 3)
dcast(df, id ~ variable, value.var = "value")
# id age name score
# 1 1 7 erik 10
# 2 2 8 stan 11
# 3 3 9 kyle 20

If the format is predictable you might want to do something really simple like
# recreate data
data <- as.matrix(c("erik",7,10,"stan",8, 11,"kyle",9,20),ncol=1)
# get individual variables
names <- data[seq(1,length(data)-2,3)]
age <- data[seq(2,length(data)-1,3)]
score <- data[seq(3,length(data),3)]
# combine variables
reformatted.data <- as.data.frame(cbind(names,age,score))

Related

Fid sample size based on num of rows in data

I have a dataset that looks like this:
Region
Name
Region 1
Name 14
Region 2
Name 18
Region 2
Name 2
Region 2
Name 21
Region 2
Name 44
Region 3
Name 64
Region 3
Name 24
Region 4
Name 1
Region 4
Name 1
Region 4
Name 98
Region 5
Name 98
Region 5
Name 8
Region 5
Name 8
Region 5
Name 8
Region 5
Name 98
I need to breakup the data by Region, and then select a random sample of only 5% of the "Name" per Region, based on the number of rows in Region.
So lets say there are 30 Name in Region 2, then i need a random sample of 3*.05. If there are 50 Name in Region 6, then i need a random sample of 5*.05.
So far, ive been able to split() the data using
d = split(data, f = data$Region)
but when i try to run an lapply function i get an error that there are different number of rows in the list that split() provided
lapply(data, function(x) {
sample_n(data, nrow(d)*.05)
} )
Any thoughts?
Thank you
Here's a base R solution.
lapply(split(data, data$Region),
\(x) x[sample(nrow(x), nrow(x) * 0.05),])
You can then convert it back into a data frame with rbind

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

how to filter for a vector within a data frame that contains some but not all of the elements within the vector

I have a large data set that contains a lot of information about departure times of bus stops. I have a main data set that contains information regarding Trip_ID, Bus_sign as well as stop_ID. I further have an index by which I would like to filter the df by.
df <- data.frame(c(10,10,10,10,10,10,10,10,10,10),
c(8,10,12,15,22,26,27,40,45,50),
c("0000001","0000002","0000003","0000004","0000005","0000006","0000007", "0000008","0000009","0000010"))
names <- c("trip_ID", "Bus_sign", "stop_ID")
colnames(df) <- names
index <- c("0000001", "0000002", "0000003", "0000011","00000013")
the data frame would look something like this
trip_ID Bus_sign stop_ID
1 10 8 0000001
2 10 10 0000002
3 10 12 0000003
4 10 15 0000004
5 10 22 0000005
6 10 26 0000006
7 10 27 0000007
8 10 40 0000008
9 10 45 0000009
10 10 50 0000010
the index contains some of the stop_ID within df, however it also contains some that are not in df. I would like to filter for matches of index and df for df$stop_ID.
the result should look like this:
trip_ID Bus_sign stop_ID
1 10 8 0000001
2 10 10 0000002
3 10 12 0000003
I have tried the subset function, however it wouldn't work
subset(df, stop_ID %in% index)

Unable to set column names to a subset of a dataframe

I run the following code, p is the dataframe loaded.
a <- sort(table(p$Title))
a1 <- as.data.frame(a)
tail(a1, 7)
a
Maths 732
Science 737
Physics 737
Chemistry 776
Social Science 905
null 57374
88117
I want to do some manipulations on the above dataframe result. I want to add column names to the dataframe. I tried the colnames function.
colnames(a1) <- c("category", "count")
I get the below error:
Error in `colnames<-`(`*tmp*`, value = c("category", "count")) :
attempt to set 'colnames' on an object with less than two dimensions
Please suggest.
As I said in the comments to your question, the categories are rownames. A reproducible example:
# create dataframe p
x <- c("Maths","Science","Physics","Chemistry","Social Science","Languages","Economics","History")
set.seed(1)
p <- data.frame(title=sample(x, 100, replace=TRUE), y="some arbitrary value")
# create the data.frame as you did
a <- sort(table(p$title))
a1 <- as.data.frame(a)
The resulting dataframe:
> a1
a
Social Science 6
Maths 9
History 10
Science 11
Physics 12
Languages 15
Economics 17
Chemistry 20
Looking at the dimensions of dataframe a1, you get this:
> dim(a1)
[1] 8 1
which means that your dataframe has 8 rows and 1 column. Trying to assign two columnnames to the a1 dataframe will hence result in an error.
You can solve your problem in two ways:
1: assign just 1 columnname with colnames(a1) <- c("count")
2: convert the rownames to a category column and then assign the columnnames:
a1$category <- row.names(a1)
colnames(a1) <- c("count","category")
The resulting dataframe:
> a1
count category
Social Science 6 Social Science
Maths 9 Maths
History 10 History
Science 11 Science
Physics 12 Physics
Languages 15 Languages
Economics 17 Economics
Chemistry 20 Chemistry
You can remove the rownames with rownames(a1) <- NULL. This gives:
> a1
count category
1 6 Social Science
2 9 Maths
3 10 History
4 11 Science
5 12 Physics
6 15 Languages
7 17 Economics
8 20 Chemistry

Adding counts of a factor to a dataframe [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 6 years ago.
I have a data frame where each row is an observation concerning a pupil. One of the vectors in the data frame is an id for the school. I have obtained a new vector with counts for each school as follows:
tbsch <- table(dt$school)
Now I want to add the relevant count value to each row in dt. I have done it using for() looping through each row in dt and making a new vector containing the relevant count and finally using cbind() to add it to dt, but I think this is very inefficient. Is there a smart/easy way to do that ?
using jmsigner's data you could do:
dt$count <- ave(dt$school, dt$school, FUN = length)
This is a lot easier in data.table v1.8.1. := now works by group. Groups don't have to be contiguous and it retains the original order. And it's just one line:
library(data.table)
# set up data
set.seed(2)
npupils <- rpois(10, 20)
pupil <- unlist(lapply(npupils, seq_len))
school <- rep(seq_along(npupils), npupils)
dt <- data.table(school = school, pupil = pupil) # Create a data.table
dt <- dt[sample(seq_len(nrow(dt)))] # Mix it up
dt
school pupil
1: 5 2
2: 6 13
3: 2 14
4: 5 3
5: 10 14
---
186: 3 11
187: 7 2
188: 8 12
189: 3 6
190: 7 10
(dt[, schoolSize := .N, by = school])
school pupil schoolSize
1: 5 2 16
2: 6 13 18
3: 2 14 15
4: 5 3 16
5: 10 14 24
---
186: 3 11 14
187: 7 2 28
188: 8 12 19
189: 3 6 14
190: 7 10 28
That has all the usual speed advantages of fast grouping, and assigns the new column by reference with no copy at all.
Edit: Deleted an answer that was only relevant for data.table prior to version 1.8.1: (Thanks to Matthew for the update).
You could try something like this:
dt <- data.frame(p=1:20, school=sample(1:5, 20, replace=T))
tbsch <- table(dt$school)
tbsch <- data.frame(tbsch)
merge(dt, tbsch, by.x="school", by.y="Var1")
You can also use plyr...and preserve the original order using this one
liner
join(dt, count(dt, "school"))

Resources