Taking cube root and log transformation in R - r

I have a table with row names corresponding to a set of people and their corresponding body mass estimates. For instance, say a matrix "mass estimate" with these values:
Name Mass
1 person_a 234
2 person_b 190
3 person_c 203
4 person_d 176
How will I, in a single line of R code, take the cube roots of the masses and then have them log transformed?
I am not sure how to ask the data above in a table format, since the final question shows it on a single line. The first column reads "Name" and the second column reads "Mass". Each row has a name (person_a) and the mass (234).
Thanks!

# Sample matrix
mat <- matrix(runif(20), ncol = 5);
# log10-transform the cube root of all entries
mat.trans <- log10(mat^(1/3))
Or with your dataframe example (which is not the same as a matrix):
df <- read.table(text =
"Name Mass
1 person_a 234
2 person_b 190
3 person_c 203
4 person_d 176", sep = "");
# log10-transform the cube root
df$transMass <- log10(df$Mass^(1/3));
# Name Mass transMass
#1 person_a 234 0.7897386
#2 person_b 190 0.7595845
#3 person_c 203 0.7691653
#4 person_d 176 0.7485042

Assuming you have dataframe df and variable named Mass, You can use this:
df$New<-log10(df$Mass^(1/3))

Related

Copy a subset of a column, based on conditions, to another dataframe in R

I have very limited R skills, and after hours searching for a solution I could not see an option that would work.
I have several large data tables. From each one, I would like to copy part of a column into an dataframe, to populate a column there.
My data tables (tabn1, tabn2, tabn3) all have the same format, but with different lengths. Each subset will have a different number of rows. I would want empty spaces to be filled with NA. I can't even copy the first column, so the subsequent are the next problem!
Ro Co Red Green Yellow
1 3 123 999 265
1 3 223 875 5877
1 4 21488 555 478
1 4 558 23698 5558
2 3 558 559 148
2 3 4579 557 59
2 4 1489 545 2369
2 4 123 999 265
3 3 558 559 148
3 3 558 23698 5558
3 4 4579 557 59
3 4 1478 4579 557
4 3 1488 555 478
4 3 1478 2945 5889
4 4 448 259 4548
4 4 26576 158 15
My new data frame col names:
cls <- c("n1","n2","n3")
I created a dataframe with the column names:
df <- setNames(data.frame(matrix(ncol=3)),cls)
For each of my tables, I want to subset Ro > = 3, Co = 3, column "Red" only
I have tried:
sub1 <- (filter(tabn1, tabn1$Ro >=3 | tabn$Co == 3)
df$n1 <- sub1$Red
> Error in `$<-.data.frame`(`*tmp*`, n1, value = c(183.94, 180.884, :
replacement has 32292 rows, data has 1
Also:
df$n1 <- cut(sub1$Red)
> Error in cut.default(sub1$Red) :
argument "breaks" is missing, with no default
I tried using df as a datatable instead of dataframe, but also got the following errors:
df <- setNames(data.table(matrix(ncol=3)),cls)
df$n1 <- sub1$Red
> Error in set(x, j = name, value = value) :
Supplied 32292 items to be assigned to 1 items of column 'nn1'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
I would subsequently tried to subset and copy from tabn2 to df$n2, and so forth. As indicated above, the original tables have different lengths.
Thanks in advance!
The issue is that the number of rows in 'df' and 'sub1' are different. 'df' is created with 1 row. Instead, we can create the 'df' directly from the 'sub1' itself
df <- sub1['Red']
names(df) <- cls[1]
Also, another way to create the data.frame, would be to specify the nrow as well
df <- as.data.frame(matrix(nrow = nrow(sub1), ncol = length(cls)),
dimnames = list(NULL, cls))
Regarding the second error with cut, it needs breaks. Either we specify the number of breaks
cut(sub1$Red, breaks = 3)
Or a vector of break points
cut(sub1$Red, breaks = c(-Inf, 100, 500, 1000, Inf))
If there are many 'tabn' objects, get them into a list, loop over the list with lapply
lst1 <- mget(ls(pattern = '^tabn\\d+$'))
out_lst <- lapply(lst1, function(x) subset(x, Ro >=3 | Co == 3)$Red)
It is possible that after subsetting and selecting the 'Red' column, the number of elements may be different. If the lengths are different, a option is to pad NA at the end for those having lesser number of elements before cbinding it
mx <- max(lengths(out_lst))
df <- do.call(cbind, lapply(out_lst, `length<-`, mx))

Grouping_by and percentage in calculating in dplyr?

I need to have a table as result which contains data about persons and their percentage. Here you can see an example of how it could look like.
person percentage
aa 17.04
bb 89.03
cc 67.99
dd 38.88
My dataset looks like this:
person points
aa 134
bb 234
cc 121
dd 134
My code is:
df$points/187*100
But I get as a result only percentage, but not data of persons. What is a better way to solve it?
In order to write the result of the calculation to the data frame, one would use the assignment operator <- to assign the result to a column in the data frame. There's no need to load the dplyr package for this because the Base R solution requires less code than the dplyr solution.
We can do this as noted in the comments, where we use the $ form of the extract operator to assign the result to a named element in the data.frame() called df.
df$pct <- df$points/187*100
We can use the $ form of the extract operator on the left side of the assignment operator because a data frame is also a list().
Given the code in the question, some percentages will be greater than 100 because at least one row has a value greater than 187. Therefore, a more accurate percentage based on total points (column adds to 100) would be:
textFile <- "person points
aa 134
bb 234
cc 121
dd 134"
df <- read.table(text=textFile,header=TRUE)
df$pct <- df$points/sum(df$points)*100
df
sum(df$pct)
...and the output:
person points pct
1 aa 134 21.50883
2 bb 234 37.56019
3 cc 121 19.42215
4 dd 134 21.50883
> sum(df$pct)
[1] 100
If 187 is the total number of persons in the data frame, one might interpret the original 'percentage' as a rate (i.e. points per 100 persons).
df$points_rate <- df$points/187*100
df
...and the output:
> df$points_rate <- df$points/187*100
> df
person points pct points_rate
1 aa 134 21.50883 71.65775
2 bb 234 37.56019 125.13369
3 cc 121 19.42215 64.70588
4 dd 134 21.50883 71.65775
If one must use dplyr for the solution, the solution with the least code avoids the magrittr pipe operator %>% by simply naming the data frame in the mutate() function.
library(dplyr)
df <- mutate(df,pct = points / 187 * 100,
points_rate = points / sum(points) * 100)
df
...and the output:
> df
person points pct points_rate
1 aa 134 71.65775 21.50883
2 bb 234 125.13369 37.56019
3 cc 121 64.70588 19.42215
4 dd 134 71.65775 21.50883
Finally, although the title of the question mentions grouping, neither the data nor code provided with the question require grouping functions to produce the output.
Assuming that you want to calculate the percentage points of the point score of each person where 187 points equal 100% (in contrast to #Len Greski's answer):
library(dplyr)
df <- df %>% mutate(percentage = points/1.87)
with output:
person points percentage
1 aa 134 71.65775
2 bb 234 125.13369
3 cc 121 64.70588
4 dd 134 71.65775

Select a dataset based on different column value but in the same row

I have a dataset with around 80 columns and 1000 Rows, a sample of this dataset follow below:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
2 F F josh linda 198
3 M NA Claude Bere 200
4 F M John Mary 350
5 F F Peter Lucy 298
And I need select all information that are different between gend.y and gend.x, like this:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
3 M NA Claude Bere 200
4 F M John Mary 350
Remember, I need to select the another 76 columns too.
I tried this command:
library(dplyr)
new.file=my.file %>%
filter(gend.y != gend.x)
But don't worked. And this message appears:
Error in Ops.factor(gend.y, gend.x) : level sets of factors are different
As #divibisan said: "Still not a reproducible example, but the error gets you closer. These 2 variables are factors, The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). You probably want to convert them to character before comparing, or fix the levels to match."
So I did this (convert them to character):
my.file$new.gend.y=as.character(my.file$gend.y)
my.file$new.gend.x=as.character(my.file$gend.x)
And after I ran my previous command with the new variables (now converted to character):
library(dplyr)
new.file=my.file %>%
filter(new.gend.y != new.gend.x | is.na(new.gend.y != new.gend.x))
And now worked as I expected. Credits #divibisan

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

replacing for loops in a function with vector calculations to speed up R

Say I have some data in data frame d1, that describes how frequently different sample individuals eat different foods, and a final column describing whether or not those foods are cool to eat. The data are structured like this.
OTU.ID<- c('pizza','taco','pizza.taco','dirt')
s1<-c(5,20,14,70)
s2<-c(99,2,29,5)
s3<-c(44,44,33,22)
cool<-c(1,1,1,0)
d1<-data.frame(OTU.ID,s1,s2,s3,cool)
print(d1)
OTU.ID s1 s2 s3 cool
1 pizza 5 99 44 1
2 taco 20 2 44 1
3 pizza.taco 14 29 33 1
4 dirt 70 5 22 0
I have written a function that, for each sample, s1:s3, the number of cool foods that were consumed, and the total number of foods that were consumed. It runs as a for loop on each line of the data table (which is extremely slow).
cool.food.abundance<- function(food.table){
samps<-colnames(food.table)
#remove column names that are not sample names
samps<-samps[!samps %in% c("OTU.ID","cool")]
#create output vectors for for loop
id<-c()
cool.foods<-c()
all.foods<-c()
#run a loop that stores output ids and results as vectors
for(i in 1:length(samps)){
x<- samps[i]
y1<-sum(food.table[samps[i]]*food.table$cool)
y2<-sum(food.table[samps[i]])
id<-c(id,x)
cool.foods<-c(cool.foods,y1)
all.foods<-c(all.foods,y2)
}
#save results as a data frame and return the data frame object
results<-data.frame(id,cool.foods,all.foods)
return(results)
}
So, if you run this function, you will get a new table of sample IDs, the number of cool foods that sample ate, and the total number of foods that sample ate.
cool.food.abundance(d1)
id cool.foods all.foods
1 s1 39 109
2 s2 130 135
3 s3 121 143
How can I replace this for-loop with vector calculations to speed this up? I would really like to be able for the function to operate on dataframes loaded with the fread function in the data.table package.
You can try
library(data.table)#v1.9.5+
dcast(melt(setDT(d1), id.var=c('OTU.ID', 'cool'))[,
sum(value) ,.(cool, variable)], variable~c('notcool.foods',
'cool.foods')[cool+1L], value.var='V1')[,
all.foods:= cool.foods+notcool.foods][, notcool.foods:=NULL]
# variable cool.foods all.foods
#1: s1 39 109
#2: s2 130 135
#3: s3 121 143
Or instead of using dcast we can summarise the result (as in #jeremycg's post) as there are only two groups
melt(setDT(d1), id.var=c('OTU.ID', 'cool'), variable.name='id')[,
list(all.foods=sum(value), cool.foods=sum(value[cool==1])) , id]
# id all.foods cool.foods
#1: s1 109 39
#2: s2 135 130
#3: s3 143 121
Or you can use base R
nm1 <- paste0('s', 1:3)
res <- t(addmargins(rowsum(as.matrix(d1[nm1]), group=d1$cool),1)[-1,])
colnames(res) <- c('cool.foods', 'all.foods')
res
# cool.foods all.foods
#s1 39 109
#s2 130 135
#s3 121 143
Here's how I would do it, with reshape2 and dplyr:
library(reshape2)
library(dplyr)
d1 <- melt(d1, id = c("OTU.ID", "cool"))
d1 %>% group_by(variable) %>%
summarise(all.foods = sum(value), cool.foods = sum(value[cool == 1]))

Resources