Create and process several columns with loop in R - r

I'm quite new to R and I would like to learn how to write a Loop to create and process several columns.
I imported a table into R that cointains data with 23 variables. For all of these variables I want to calculate the per capita valuem multiply this with 1000 and either write the data into a new table or in the same table as the old data.
So to this for only one column my operation looked like this:
<i>agriculture<-cbind(agriculture,"Total_value_per_capita"=agriculture$Total/agriculture$Total.Population*1000)</i>
Now I'm asking how to do this in a Loop for the 23 variables so that I won't have to write 23 similar lines of code.
I think the solution might look quite similar to the code pasted in this thread:
loop to create several matrix in R (maybe using paste)
but I dind't got it working on my code.
So any suggestion would be very helpful.

I would always favor an appropriate *ply function over loops in R. In this case sapply could be your friend:
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
df.per.capita <– as.data.frame(
sapply(
df[ colnames(df) != "c" ], function(x){ x/df$c *1000 }
)
)
For more complicated cases, you should definitely have a look at the plyr package.

This can be done using sweep function. Using Beasterfield's data generation but setting the seed you can obtain the same results
set.seed(001)
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
per.capita <- sweep(df[,colnames(df) != "c"], 1, STATS=df$c, FUN='/')*1000
per.capita
a b
1 300.0000 300.0000
2 2000.0000 1000.0000
3 833.3333 1000.0000
4 7000.0000 10000.0000
5 222.2222 555.5556
6 1000.0000 875.0000
7 1285.7143 1142.8571
8 1200.0000 800.0000
9 3333.3333 333.3333
10 250.0000 2250.0000
Comparing with Beasterfield's results:
all.equal(df.per.capita, per.capita)
[1] TRUE

Related

How do I use dataframe names as inputs for var in for loop (R language)?

In R, I defined the following function:
race_ethn_tab <- function(x) {
x %>%
group_by(RAC1P) %>%
tally(wt = PWGTP) %>%
print(n = 15) }
The function simply generates a weighted tally for a given dataset, for example, race_ethn_tab(ca_pop_2000) generates a simple 9 x 2 table:
1 Race 1 22322824
2 Race 2 2144044
3 Race 3 228817
4 Race 4 1827
5 Race 5 98823
6 Race 6 3722624
7 Race 7 116176
8 Race 8 3183821
9 Race 9 1268095
I have to do this for several (approx. 10 distinct datasets) where it's easier for me to keep the dfs distinct rather than bind them and create a year variable. So, I am trying to use either a for loop or purrr::map() to iterate through my list of dfs.
Here is what I tried:
dfs_test <- as.list(as_tibble(ca_pop_2000),
as_tibble(ca_pop_2001),
as_tibble(ca_pop_2002),
as_tibble(ca_pop_2003),
as_tibble(ca_pop_2004))
# Attempt 1: Using for loop
for (i in dfs_test) {
race_ethn_tab(i)
}
# Attempt 2: Using purrr::map
race_ethn_outs <- map(dfs_test, race_ethn_tab)
Both attempts are telling me that group_by can't be applied to a factor object, but I can't figure out why the elements in dfs_test are being registered as factors given that I am forcing them into the tibble class. Would appreciate any tips based on my approach or alternative approaches that could make sense here.
This, from #RonakShah, was exactly what was needed:
You code should work if you use list instead of as.list. See output of
as.list(as_tibble(mtcars), as_tibble(iris)) vs list(as_tibble(mtcars),
as_tibble(iris)) – Ronak Shah Oct 2 at 0:23
We can use mget to return a list of datasets, then loop over the list and apply the function
dfs_test <- mget(paste0("ca_pop_", 2000:2004))
It can be also made more general if we use ls
dfs_test <- mget(ls(pattern = '^ca_pop_\\d{4}$'))
map(dfs_test, race_ethn_tab)
This would make it easier if there are 100s of objects already created in the global environment instead of doing
list(ca_pop_2000, ca_pop_2001, .., ca_pop_2020)

For loop on dates in R

R-users,
I have this dataframe:
head(M2006)
X.ID_punto MM.GG.AA Rad_SWD
2945377 1 0001-01-06 19.918
2945378 2 0001-01-06 19.911
2945379 1 0001-02-06 19.903
2945380 2 0001-02-06 19.893
2945381 1 0001-03-06 19.875
2945382 2 0001-03-06 19.858
What I need to do is to obtain different subsets for every dates (MM.GG.AA):
subset(M2006, M2006$MM.GG.AA=="0001-10-06" )
or, in other words, different subsets for every sites (X.ID_punto):
subset(M2006, M2006$X.ID_punto==1)
Is it possible to loop this on sites (X.ID_punto) or dates (MM.GG.AA)?
I have tried in this way:
output<- data.frame(ID=rep(1:365))
for (p in as.factor(M2006[,1])) {
sub<- subset(M2006, M2006$X.ID_punto==p )
output[,p] <- sub$Rad_SWD
}
the code run, but without looping on every ID.
If I can't loop, I have to write down subset(M2006, M2006$X.ID_punto==xxx) for a thousand times...
Thank you in advance!
Fra
I think from your description of input and desired output you an acheive this pretty simply using the reshape package and the cast function:
require(reshape)
cast( M2006 , MM.GG.AA ~ X.ID_punto , value = .(Rad_SWD) )
# MM.GG.AA 1 2
#1 0001-01-06 19.918 19.911
#2 0001-02-06 19.903 19.893
#3 0001-03-06 19.875 19.858
It will certainly be quicker than using loops ( it isn't going to be the absolute quickest solution but I imagine < 1-2 seconds).
I've found a possible solution by myself.
I won't cancel my question, maybe someone will find it useful.
#first of all, since I have 1008 sites (X.ID_punto)
#I created a list of my sites
list<- rep(1:1008)
#then, create a dataframe where I'll store my subsets.
#Every subset will be a column of 365 observations
output<- data.frame(site1=rep(1:365))
#loop the subset function on list of 1008 sites
for (p in 1:length(list)) {
print(p) #just to see if loop run
sub<- subset(M2006, M2006$X.ID_punto==p )
output[,p] <- sub$Rad_SWD #add the subset, as a column, to output dataframe
}
write.csv(uscita, "output.csv")#save the resulted data frame

Using Histogram as input in R

This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Apply multiple functions to column using tapply

Could someone please point to how we can apply multiple functions to the same column using tapply (or any other method, plyr, etc) so that the result can be obtained in distinct columns). For eg., if I have a dataframe with
User MoneySpent
Joe 20
Ron 10
Joe 30
...
I want to get the result as sum of MoneySpent + number of Occurences.
I used a function like --
f <- function(x) c(sum(x), length(x))
tapply(df$MoneySpent, df$Uer, f)
But this does not split it into columns, gives something like say,
Joe Joe 100, 5 # The sum=100, number of occurrences = 5, but it gets juxtaposed
Thanks in advance,
Raj
You can certainly do stuff like this using ddply from the plyr package:
dat <- data.frame(x = rep(letters[1:3],3),y = 1:9)
ddply(dat,.(x),summarise,total = NROW(piece), count = sum(y))
x total count
1 a 3 12
2 b 3 15
3 c 3 18
You can keep listing more summary functions, beyond just two, if you like. Note I'm being a little tricky here in calling NROW on an internal variable in ddply called piece. You could have just done something like length(y) instead. (And probably should; referencing the internal variable piece isn't guaranteed to work in future versions, I think. Do as I say, not as I do and just use length().)
ddply() is conceptually the clearest, but sometimes it is useful to use tapply instead for speed reasons, in which case the following works:
do.call( rbind, tapply(df$MoneySpent, df$User, f) )

Resources