Unable to set column names to a subset of a dataframe - r

I run the following code, p is the dataframe loaded.
a <- sort(table(p$Title))
a1 <- as.data.frame(a)
tail(a1, 7)
a
Maths 732
Science 737
Physics 737
Chemistry 776
Social Science 905
null 57374
88117
I want to do some manipulations on the above dataframe result. I want to add column names to the dataframe. I tried the colnames function.
colnames(a1) <- c("category", "count")
I get the below error:
Error in `colnames<-`(`*tmp*`, value = c("category", "count")) :
attempt to set 'colnames' on an object with less than two dimensions
Please suggest.

As I said in the comments to your question, the categories are rownames. A reproducible example:
# create dataframe p
x <- c("Maths","Science","Physics","Chemistry","Social Science","Languages","Economics","History")
set.seed(1)
p <- data.frame(title=sample(x, 100, replace=TRUE), y="some arbitrary value")
# create the data.frame as you did
a <- sort(table(p$title))
a1 <- as.data.frame(a)
The resulting dataframe:
> a1
a
Social Science 6
Maths 9
History 10
Science 11
Physics 12
Languages 15
Economics 17
Chemistry 20
Looking at the dimensions of dataframe a1, you get this:
> dim(a1)
[1] 8 1
which means that your dataframe has 8 rows and 1 column. Trying to assign two columnnames to the a1 dataframe will hence result in an error.
You can solve your problem in two ways:
1: assign just 1 columnname with colnames(a1) <- c("count")
2: convert the rownames to a category column and then assign the columnnames:
a1$category <- row.names(a1)
colnames(a1) <- c("count","category")
The resulting dataframe:
> a1
count category
Social Science 6 Social Science
Maths 9 Maths
History 10 History
Science 11 Science
Physics 12 Physics
Languages 15 Languages
Economics 17 Economics
Chemistry 20 Chemistry
You can remove the rownames with rownames(a1) <- NULL. This gives:
> a1
count category
1 6 Social Science
2 9 Maths
3 10 History
4 11 Science
5 12 Physics
6 15 Languages
7 17 Economics
8 20 Chemistry

Related

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

Dummy variables based on values from different columns

I currently have a data frame that looks as such:
dat2<-data.frame(
ID=c(100,101,102,103),
DEGREE_1=c("BA","BA","BA","BA"),
DEGREE_2=c(NA,"BA",NA,NA),
DEGREE_3=c(NA,"MS",NA,NA),
YEAR_DEGREE_1=c(1980,1990,2000,2004),
YEAR_DEGREE_2=c(NA,1992,NA,NA),
YEAR_DEGREE_3=c(NA,1996,NA,NA)
)
ID DEGREE_1 DEGREE_2 DEGREE_3 YEAR_DEGREE_1 YEAR_DEGREE_2 YEAR_DEGREE_3
100 BA <NA> <NA> 1980 NA NA
101 BA BA MS 1990 1992 1996
102 BA <NA> <NA> 2000 NA NA
103 BA <NA> <NA> 2004 NA NA
I would like to create dummy variables coded 0/1 based on what kind of degree was earned, using the completion of one BA degree as the base.
The completed data frame would have a second BA degree dummy, an MS degree dummy, and so on. For example, for ID 101, both dummies would have a value of 1. The completion of two MS degrees would not require a dummy, i.e. if someone completed two MS degrees, then the MS degree dummy would be 1 and there would be no dummy to signify completing two MS degrees.
Like such
This is a simple snapshot of a much bigger data frame that has many different degrees types besides BA and MS, so it isn't ideal for me to create if/else statements for every single degree type.
Any advice would be appreciated.
You could also include new columns and assign the value based on the DEGREE columns.
Including new columns, with all values equal 0:
dat2 <- cbind(dat2, BA_2nd = 0)
dat2 <- cbind(dat2, MS = 0)
Changing the value to 1, based on your conditions:
dat2[!is.na(dat2$DEGREE_2), 8] <- 1
dat2[!is.na(dat2$DEGREE_3) & dat2$DEGREE_3 == "MS", 9] <- 1
dat2
You can adapt it to all the conditions you have. This code generates only the output table that you included.

Random selection based on a variable in a R dataframe [duplicate]

This question already has answers here:
Take the subsets of a data.frame with the same feature and select a single row from each subset
(3 answers)
Closed 7 years ago.
I have a data frame with 1000 columns. It is a dataset of animals from different breeds. However I have more animals from some breeds. So what I want to do is to select a random sample of those breeds with more animals and make all breeds with the same number of observations.
In details: I have 400 Holstein animals, 300 Jersey, 100 Hereford and 150 Nelore and 50 Canchim. What I want to do is to randomly select 50 animals from each breed. So I would have a total of 250 animals at the end. I know how to randomly select using runif, however I am not sure how I can apply that in my case.
My data looks like:
Breed ID Trait1 Trait2 Trait3
Holstein 1 11 22 44
Jersey 2 22 33 55
Nelore 3 33 44 66
Nelore 4 44 55 77
Canchim 5 55 66 88
I have tried:
Data = data[!!ave(seq_along(data$Breed), unique(data$Breed), FUN=function(x) sample(x, 50) == x),]
However, it does not work and I am not allowed to install the package dplyr in the server that I am using.
Thank in advance.
You can split your animals data frame on the breed, and then apply a custom function to each chunk which will randomly extract 50 rows:
animals.split <- split(animals, animals$Breed)
animals.list <- lapply(animals.split, function(x) {
y <- x[sample(nrow(x), 50), ]
return(y)
}
result <- unsplit(animals.list, f = animals$Breed)

R "melt-cast" like operation

I have a file contains contents like this:
name: erik
age: 7
score: 10
name: stan
age:8
score: 11
name: kyle
age: 9
score: 20
...
As you can see, each record actually contains 3 rows in the file. I am wondering how can I read in the file and transform into data dataframe looks like below:
name age score
erik 7 10
stan 8 11
kyle 9 20
...
What I have done so far(thanks tcash21):
> data <- read.table(file.choose(), header=FALSE, sep=":", col.names=c("variable", "value"))
> data
variable value
1 name erik
2 age 7
3 score 10
4 name stan
5 age 8
6 score 11
7 name kyle
8 age 9
9 score 20
I am thinking how can I split the column into two columns by : and then maybe use something similar like cast in reshape package to do what I want?
or how can I get the rows that has index number 1,4,7,... only, which has a constant step
Thanks!
Another possibility:
library(reshape2)
df$id <- rep(1:(nrow(df)/3), each = 3)
dcast(df, id ~ variable, value.var = "value")
# id age name score
# 1 1 7 erik 10
# 2 2 8 stan 11
# 3 3 9 kyle 20
If the format is predictable you might want to do something really simple like
# recreate data
data <- as.matrix(c("erik",7,10,"stan",8, 11,"kyle",9,20),ncol=1)
# get individual variables
names <- data[seq(1,length(data)-2,3)]
age <- data[seq(2,length(data)-1,3)]
score <- data[seq(3,length(data),3)]
# combine variables
reformatted.data <- as.data.frame(cbind(names,age,score))

recoding using R

I have a data set with dam, sire, plus other variables but I need to recode my dam and sire id's. The dam column is sorted and each animal is only apprearing once. On the other hand, the sire column is unsorted and some animals are appearing more than once.
I would like to start my numbering of dams from 50,000 such that the first animal will get 50001, second animal 50002 and so on. I have this script that numbers each dam from 1 to N and wondering if it can be modified to begin from 50,000.
mydf$dam2 <- as.numeric(factor(paste(mydf$dam,sep="")))
*EDITED
my data set is similar to this but more variables
dam <- c("1M521","1M584","1M790","1M871","1M888","1M933")
sire <- c("1X057","1T456","1W865","1W209","1W209","1W648")
wt <- c(369,300,332,351,303,314)
p2 <- c(NA,16,18,NA,NA,15)
mydf <- data.frame(dam,sire,wt,p2)
For the sire column, I would like to start numbering from 10,000.
Any help would be very much appreciated.
Baz
At the moment, those sire and dam columns are factor variables, but in this case that means you can just add the as.numeric() results to you base number:
> mydf$dam_n <- 50000 +as.numeric(mydf$dam)
> mydf$sire_n <- 10000 +as.numeric(mydf$sire)
> mydf
dam sire wt p2 dam_n sire_n
1 1M521 1X057 369 NA 50001 10005
2 1M584 1T456 300 16 50002 10001
3 1M790 1W865 332 18 50003 10004
4 1M871 1W209 351 NA 50004 10002
5 1M888 1W209 303 NA 50005 10002
6 1M933 1W648 314 15 50006 10003
Why not use:
names(mydf$dam2) <- 50000:whatEverYourLengthIs
I am not sure if I understood your datastructures completly but usually the names-function is used to set names.
EDIT:
You can use dimnames to names columns and rows.
Like:
[,1] [,2]
a 1 2
b 4 5
c 7 8
and
dimnames(mymatrix) <- list(c("Jan", "Feb", "Mar"), c("2005", "2006"))
yields
2005 2006
Jan 1 2
Feb 4 5
Mar 7 8

Resources