Duplicate rows in R N times and adding new count column - r

Ciao: Here is the data I have "have"
have = data.frame(c(1,2,3),
c(90,87,71),
c(600,601,602))
colnames(have) <- c("STUDENT","SCORE","TYPE")
Here is the data I want "want"
want = data.frame(c(1,1,2,2,3,3),
c(90,90,87,87,71,71),
c(600,600,601,601,602,602),
c(100,101,100,101,100,101))
colnames(want) <- c("STUDENT","SCORE","TYPE","CLASS")
As shown above here starting from "have" data I want to copy the row for every STUDENT; add new column "CLASS" which is equals to 100 for the STUDENT's first row and 101 for the STUDENT's second row.
Cheers!

I am creating a additional key for merge
have$key=1
mergedf=data.frame('CLASS'=c(100,101),'key'=1)
merge(have,mergedf,all.x=T)
key STUDENT SCORE TYPE CLASS
1 1 1 90 600 100
2 1 1 90 600 101
3 1 2 87 601 100
4 1 2 87 601 101
5 1 3 71 602 100
6 1 3 71 602 101

李哲源 plus Axeman provided the answers
## R core
data.frame(have[rep(1:nrow(have), each = 2), ], CLASS = c(100, 101),
row.names = seq_len(2 * nrow(have)))
## dplyr
dplyr::bind_rows('100' = have, '101' = have, .id = 'CLASS')

classes <- as.matrix(seq(100,101, by=1))
classes_rep <-matrix(classes, nrow=nrow(have)*nrow(classes))
want <- cbind(rbind(have, have), classes_rep)

Related

How can I calculate the inter-pair correlation of a variable according to id in the whole dataframe?

I have a twin-dataset, in which there is one column called wpsum, another column is family-id, which is the same for corresponding twin pairs.
wpsum family-id
twin 1 14 220
twin 2 18 220
I want to calculate the correlation between wpsumof those with the same family-id, while there are also some single family id's, if one twin did not take part in the re-survey. family-id is a character.
There's no correlation between wpsum of those with the same family-id, as you put it, mainly because there's no third variable with which to correlate wpsum within the family-id groups (see my comment), but you can get the difference in wpsum scores within the groups. Maybe that's what you meant by correlation. Here's how to get those (I changed and expanded your example):
dat <- data.frame(wpsum = c(14, 18, 20, 5, 10, NA, 1),
family_id = c("220","220","221","221","222","222","223"))
dat
wpsum family_id
1 14 220
2 18 220
3 20 221
4 5 221
5 10 222
6 NA 222
7 1 223
diffs <- by(dat, dat$family_id, function(x) abs(x$wpsum[1] - x$wpsum[2]))
diffs
dat$family_id: 220
[1] 4
------------------------------
dat$family_id: 221
[1] 15
------------------------------
dat$family_id: 222
[1] NA
------------------------------
dat$family_id: 223
[1] NA
You can make a data.frame with this new variable of differences like so:
diff.frame <- data.frame(diffs = as.numeric(diffs), family_id = names(diffs))
diff.frame
diffs family_id
1 4 220
2 15 221
3 NA 222
4 NA 223
Note that neither missing values nor missing observations are a (coding) problem here - they just result in missing differences without error. If you started having more than two observations within each family ID, though, then you'd need to do something different.

Calculated field based on Ranges in a second data frame in R

I have found similar posts regarding this task, but all of which have a common ID joining the two tables.
I have one data frame which contains sale records (sales_df). For this example I have simplified the data table so that it contains only 5 records. I would like to create a new column in the sales_df that calculates what the fee would be given a sale price amount as defined in the fee table (pricing_fees). Please note that the number of actual pricing fee ranges that I have to account for are around 30, so writing this into a mutate statement is something that I would like to try and avoid.
The two data frames are coded as follows
sales_df <- data.frame(invoice_id = 1:5,
sale_price = c(100, 275, 350, 500, 675))
pricing_fees <- data.frame(min_range = c(0, 50, 100, 200, 300, 400, 500), # >=
max_range = c(50, 100, 200, 300, 400, 500, 1000), # <
buyer_fee = c(1, 1, 25, 50, 75, 110, 125))
In the end I would like the resulting sales_df to look something like this.
invoice_id sale_price buyer_fee
1 1 100 25
2 2 275 50
3 3 350 75
4 4 500 125
5 5 675 125
Thanks in advance
You can use findInterval function which is supposed to be efficient in splitting values over ranges (since it uses binary search) :
# build consecutive increasing ranges of fees
# (in order to use findInterval, since it works on ranges defined in a single vector)
pricing_fees <- pricing_fees[order(pricing_fees$min_range),]
consecFees <- data.frame(ranges=c(pricing_fees$min_range[1], pricing_fees$max_range),
fees=c(pricing_fees$buyer_fee,NA))
# consecFees now is :
#
# ranges fees
# 1 0 1 ---> it means for price in [0,50) -> 1
# 2 50 1 ---> it means for price in [50,100) -> 1
# 3 100 25 ---> it means for price in [100,200) -> 25
# 4 200 50 ... and so on
# 5 300 75
# 6 400 110
# 7 500 125
# 8 1000 NA ---> NA because for values >= 1000 we set NA
# add the column to sales_df using findInterval
sales_df$buyer_fee <- consecFees$fees[findInterval(sales_df$sale_price,consecFees$ranges)]
Result :
> sales_df
invoice_id sale_price buyer_fee
1 1 100 25
2 2 275 50
3 3 350 75
4 4 500 125
5 5 675 125
You can also use cut to "bin" sales_df$sale_price values and label bins with corresponding buyer_fee values.
# Make pricing_fee table with unique buyer_fee
brks <- do.call(rbind, by(pricing_fees, pricing_fees$buyer_fee, FUN = function(x)
data.frame(min_range = min(x$min_range), max_range = max(x$max_range), buyer_fee = unique(x$buyer_fee))))
sales_df$buyer_fee = as.numeric(as.character(cut(
sales_df$sale_price,
breaks = c(0, brks$max_range),
labels = brks$buyer_fee,
right = F)))
# invoice_id sale_price buyer_fee
#1 1 100 25
#2 2 275 50
#3 3 350 75
#4 4 500 125
#5 5 675 125

Match column and rows then replace

I have to analyze data from an economic experiment.
My database is composed of 14 976 observations with 212 variables. Within this database we have other informations like the profit, total profit, the treatments and other variables.
You can see that I have two types :
Type 1 is for sellers
Type 2 is for buyers
For some variables, results were put in the buyers (type 2) rows and not in the sellers ones (which is a choice completely arbitrary choice). However I would like to analyze gender of sellers who overcharged (for instance). So I need to manipulate my database and I don't know how to do this.
Here, you have part of the database :
ID Gender Period Matching group Group Type Overcharging ...
654 1 1 73 1 1 NA
654 1 2 73 1 1 NA
654 1 3 73 1 1 NA
654 1 4 73 1 1 NA
435 1 1 73 2 1 NA
435 1 2 73 2 1 NA
435 1 3 73 2 1 NA
435 1 4 73 2 1 NA
708 0 1 73 1 2 1
708 0 2 73 1 2 0
708 0 3 73 1 2 0
708 0 4 73 1 2 1
546 1 1 73 2 2 0
546 1 2 73 2 2 0
546 1 3 73 2 2 1
546 1 4 73 2 2 0
To do what I'd like to I have many informations (only one seller was matched with one buyer in at the period x, in the group x, matching group x, and with treatment x...).
To give you and example, in matching group 73 we know that at period 1 subject 708 was overcharged (the one in group 1). As I know that this men belongs to group 1 and matching group 73, I am able to identify the seller who has overcharged him at period 1 : subject 654 with gender =1.
So, I would like to put overcharging (and some others) buyers values on the sellers rows (type ==1) to analyze sellers behavior but at the right period, for the right group and the right matching group.
I have a long way of doing it with data.frames. If you are looking to code in R long term I would suggest checking out either (i) dplyr package, part of the tidyverse suite or (ii) data.table package. The first one has the most popular syntax, and is tied together nicely with a bunch of useful packages. The second is harder to learn but quicker. For your size data, this is negligible though.
In base data.frames, here is something I hope matches your request. Let me know if I've mistaken anything, or been unclear.
# sellers data eg
dt1 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 1,
Overcharging = NA)
# buyers data eg
dt2 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 2,
Overcharging = c(1,0,0,1))
# make my current data view
dt <- rbind(dt1, dt2)
dt[]
# split in to two data frames, on the Type column:
dt_split <- split(dt, dt$Type)
dt_split
# move out of list
dt_suffix <- seq_along(dt_split)
dt_names <- sprintf("dt%s", dt_suffix)
for(name in dt_names){
assign(name, dt_split[match(name, dt_names)][[1]])
}
dt1[]
dt2[]
# define the columns in which to match up the buyer to seller
merge_cols <- c("Period", "MatchGroup", "Group")
# define the columns you want to merge, that you know are NA
na_cols <- c("Overcharging")
# now use merge operation, and filter dt2, to pull in only columns you want
# I suggest dropping the na_cols first in dt1, as otherwise it will create two
# columns post-merge: Overcharging, i.Overcharging
dt1 <- dt1[,setdiff(names(dt1), na_cols)]
dt1_new <- merge(dt1,
dt2[, c(merge_cols, na_cols)], # filter dt2
by = merge_cols, # columns to match on
all.x = TRUE) # dt1 is x, dt2 is y. Want to keep all of dt1
# if you want to bind them back together, ensure the column order matches, and
# bind e.g.
dt1_new <- dt1_new[, names(dt2)]
dt_final <- rbind(dt1_new, dt2)
dt_final[]
What my line of thinking is to make these buyers and sellers data frames in to two separate ones. Then identify how they join, and migrate the data you need from buyers to sellers. Then finally bring them back together if so desired.

how to subtract a value from one column from a value from a previous row, different column in r

I have a dataframe composed of 3 columns and ~2000 rows.
ID DistA DistB
1 100 200
2 239 390
3 392 550
4 700 760
5 770 900
The first column (ID) is a unique identifier for each row. I'd like my script to read each row, and subtract/compare the value from column "DistA" in each row from the value of column "DistB" from the previous row. If the difference of the distance of any subsequent pairs is <40, to output that they are in the same area.
For example: In the above example comparing row 2 and 1, '239' from row 2 and '200' from row 1 is <40 and therefore in the same area. The same way 2 and 3, are in the same area ie the difference is 2 and 2<40. But rows 3 and 4 are not as the difference is 150.
I have not been able to go far, as I am stuck in the comparison (subtraction/difference) step. I have tried to write a loop to iterate in all the rows, but I keep getting errors. Should I even use a loop, or can I do this without a loop?
I am a new R learner, and this is the 'rookie' code that I have so far. Where am I going wrong. Thanks in advance:
#the function to compare the two columns
funct <- function(x){
for(i in 1:(nrow(dat)))
(as.numeric(dat$DistA[i-1])) - (as.numeric(dat$DistB[i]))}
#creating a new column 'new2' with the differences
dat$new2 <- apply(dat[,c('DistB','DistA')]),1, funct
When I run this, I get the following error:
Error: unexpected ',' in "dat$new2 <- apply(dat[,c('DistB','DistA')]),"
I'll appreciate all the comments/suggestions.
I believe dplyr can help you here.
library(dplyr)
dfData <- data.frame(ID = c(1, 2, 3, 4, 5),
DistA = c(100, 239, 392, 700, 770),
DistB = c(200, 390, 550, 760, 900))
dfData <- mutate(dfData, comparison = DistA - lag(DistB))
This results in...
dfData
ID DistA DistB comparison
1 1 100 200 NA
2 2 239 390 39
3 3 392 550 2
4 4 700 760 150
5 5 770 900 10
You could then check to see if a row is within the same "area" as your previous row.
We could also try data.table (similar to the approach as suggested in the comments by #David Arenburg). shift is a new function introduced in the devel version with type='lag' as the default option. It can be installed from here
library(data.table)#data.table_1.9.5
setDT(df1)[, Categ := c('Diff', 'Same')[
(abs(DistA-shift(DistB)) < 40 )+1L]][]
# ID DistA DistB Categ
#1: 1 100 200 NA
#2: 2 239 390 Same
#3: 3 392 550 Same
#4: 4 700 760 Diff
#5: 5 770 900 Same
If we need both the 'difference' and 'category' columns
setDT(df1)[,c('Dist', 'Categ'):={tmp= abs(DistA-shift(DistB))
list(tmp, c('Diff', 'Same')[(tmp <40)+1L])}]
df1
# ID DistA DistB Dist Categ
#1: 1 100 200 NA NA
#2: 2 239 390 39 Same
#3: 3 392 550 2 Same
#4: 4 700 760 150 Diff
#5: 5 770 900 10 Same

Is there a way stop table from sorting in R

Problem setup: Creating a function to take multiple CSV files selected by ID column and combine into 1 csv, then create an output of number of observations by ID.
Expected:
complete("specdata", 30:25) ##notice descending order of IDs requested
## id nobs
## 1 30 932
## 2 29 711
## 3 28 475
## 4 27 338
## 5 26 586
## 6 25 463
I get:
> complete("specdata", 30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932
Which is "wrong" because it has been sorted by id.
The CSV file I read from does have the data in descending order. My snippet:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
xTab3<-as.data.frame(table(dfTable$ID[ccTab]),)
colnames(xTab3)<-c("id","nobs")
And as near as I can tell, the third line is where sorting occurs. I broke out the expression and it happens in the table() call. I've not found any option or parameter I can pass to make something like sort=FALSE. You'd think...
Anyway. Any help appreciated!
So, the problem is in the output of table, which are sorted by default. For example:
> r = sample(5,15,replace = T)
> r
[1] 1 4 1 1 3 5 3 2 1 4 2 4 2 4 4
> table(r)
r
1 2 3 4 5
4 3 2 5 1
If you want to take the order of first appearance, you are going to get your hands a little bit dirty by recoding the table function:
unique_r = unique(r)
table_r = rbind(label=unique_r, count=sapply(unique_r,function(x)sum(r==x)))
table_r
[,1] [,2] [,3] [,4] [,5]
label 1 4 3 5 2
count 4 5 2 1 3
One way to get around this is...don't use table. Here's an example where I create three one-line data sets from your data. Then I read them in with a descending sequence, with read.table and it seems to be okay.
The real big thing here is that multiple data sets should be placed in a list upon being read into R. You'll get the exact order of data sets you want that way, among other benefits.
Once you've read them into R the way you want them, it's much easier to order them at the very end. Ordering of rows (for me) is usually the very last step.
> dat <- read.table(h=T, text = "id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932")
Write three one-line files:
> write.table(dat[3,], "dat3.csv", row.names = FALSE)
> write.table(dat[2,], "dat2.csv", row.names = FALSE)
> write.table(dat[1,], "dat1.csv", row.names = FALSE)
Read them in using a 3:1 order:
> do.call(rbind, lapply(3:1, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 27 338
# 2 26 586
# 3 25 463
Then, if we change 3:1 to 1:3 the rows "comply" with our request
> do.call(rbind, lapply(1:3, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 25 463
# 2 26 586
# 3 27 338
And just for fun
> fun <- function(z){
do.call(rbind, lapply(z, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE) }))
}
> fun(c(2, 3, 1))
# id nobs
# 1 26 586
# 2 27 338
# 3 25 463
You may try something like this:
t1 <- c(5,3,1,3,5,5,5)
as.data.frame(table(t1)) ##result in ascending order
# t1 Freq
#1 1 1
#2 3 2
#3 5 4
t1 <- factor(t1)
as.data.frame(table(reorder(t1, rep(-1, length(t1)),sum)))
# Var1 Freq
#1 5 4
#2 3 2
#3 1 1
In your case you are complaining about the actions of the table function with a single argument returning the items with the names in ascending order and you wnat them in descending order. You could have simply used the rev() function around the table call.
xTab3<-as.data.frame( rev( table( dfTable$ID[ccTab] ) ),)
(I'm not sure what that last comma is doing in there.) The sort order in the original would not be expected to determine the order of a table operation. Generally R will return results with discrete labels sorted in alpha (ascending) order unless the levels of a factor item have been specified differently. That's one of those R-specific rules that may be difficult to intuit. The other R-specific rule that may be difficult to grasp (although not really a problem here) is that arguments are often expected to be in the form of R-lists.
It's probably wise to think about R-table objects at this point (and what happens with the as.data.frame call. table-objects are actually R-matrices, so the feature that you wanted to sort by was actually the rownames of that table object and are of class character:
r = sample(5,15,replace = T)
table(r)
#r
#2 3 4 5
#5 3 2 5
rownames(table(r))
#[1] "2" "3" "4" "5"
str(as.data.frame(table(r)))
#-------
'data.frame': 4 obs. of 2 variables:
$ r : Factor w/ 4 levels "2","3","4","5": 1 2 3 4
$ Freq: int 5 3 2 5
I just wanna share this homework I've done
complete <- function(directory, id=1:332){
setwd("E:/Coursera")
files <- dir(directory, full.names = TRUE)
data <- lapply(files, read.csv)
specdata <- do.call(rbind, data)
cleandata <- specdata[!is.na(specdata$sulfate) & !is.na(specdata$nitrate),]
targetdata <- data.frame(Date=numeric(0), sulfate=numeric(0), nitrate=numeric(0), ID=numeric(0))
result<-data.frame(id=numeric(0), nobs=numeric(0))
for(i in id){
targetdata <- cleandata[cleandata$ID == i, ]
result <- rbind(result, data.frame(table(targetdata$ID)))
}
names(result) <- c("id","nobs")
result
}
A simple solution that no one has proposed yet is combining table() with unique() function. The unique() function does the behaviour that you are looking (listing unique IDs in order of appearance).
In your case it would be something like this:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
x<-dfTable$ID[ccTab] #unique IDs
xTab3<-as.data.frame(table(x)[unique(x)],) #here you sort the "table()" result in order of appearance
colnames(xTab3)<-c("id","nobs")

Resources