bind tables of different length - r

Thank you good ppl! This must be simple but I'm banging my head against it for a while. Please help. I have a large data set from which I get all kinds of information via table(). I then want to store that information which is essentially different counts, so I also want to store the rownames that were counted. For a reproducible example consider
```
a<-c("a","b","c","d","a","b") #one count, occurring twice for a and
b and once for c and d
b<-c("a","c") # a completly different property from the dataset
occurring once for a and c
x<-table(a)
y<-table(b) #so now x and y hold the information I seek
How can I merge/bind/whatever to get from x and y to this form:
x. y.
a 2. 1
b 2. 0
c 1. 1
d. 1 0
HOWEVER, I need to use the solution iteratively, in a loop that takes x and y and gets the requested form above, and then gets more tables added, each hopefully adding a column. One of my many failed attempts, just to show the logic, is:
`. member<-function (data=dfm,groupvar='group',analysis=kc15 {
res<-matrix(NA,ncol=length(analysis$size)+1)
res[,1]<-table(docvars(data,groupvar))
for (i in 1:length(analysis$size)) {
r<-table(docvars(data,groupvar)[analysis$cluster==i])
res<-cbind(res,r)
}
res
}`
So, to sum, the reproducible example above means to replicate the first column in res and an r, and I'm seeking (I think) a correct solution instead of the cbind, which would allow adding columns of different length but similar names, as in the example above.
Please help its embarrassing how much time I'm wasting on this

In base R, you can use table, stack and full join the two counts.
out <- merge(stack(table(a)), stack(table(b)), by = 'ind', all = TRUE)
out
# ind values.x values.y
#1 a 2 1
#2 b 2 NA
#3 c 1 1
#4 d 1 NA
If you want to replace NA with 0, you can do :
out[is.na(out)] <- 0

One purrr and tidyr solution could be:
map_dfr(lst, ~ stack(table(.)), .id = "ID") %>%
pivot_wider(names_from = "ID", values_from = "values", values_fill = list(values = 0))
ind a b
<chr> <int> <int>
1 a 2 1
2 b 2 0
3 c 1 1
4 d 1 0
lst being:
lst <- list(a = a,
b = b)

Related

Vectorized Calculation in R

I was doing some calculation in R and was confused by the logic R uses.
For example,
table <- data.frame(a = c(1,NA,2,1), b= c(1,1,3,2))
Here, I am going to create the third column "c"
Column c will be 0 if column a contains NA. Otherwise it will be addition of column a and column b.
So the column c should be
c(2,0,5,3)
I wrote:
table$c <- 0
table$c[!is.na(table$a)] <- table$a + table$b
And I have column c as
c(2,0,NA,5)
I see that
table$c[3] = table$a[2]+table$b[2]
when I wanted it to be table$c[3] = table$a[3] + table$b[3].
I thought R would skip index number 2 in the left and right side and jump to index 3 in the calculation, but in fact, R skipped index number 2 in the left but didn't skip number 2 in the right side...
Why does this happen? How should I prevent this?
Thank you.
Use
table$c <- apply(table, 1, sum)
table$c[is.na(table$c)] <- 0
Or even more simple if you only start learning R:
table$c <- table$a + table$b
table$c[is.na(table$c)] <- 0
In order to prevent things like in your case, don't ask R to do two things at the same time like here:
table$c[!is.na(table$a)] <- table$a + table$b
You basically asked R to check if c contains NA 'on the fly', and it's not how R is working.
Alternatively, you could make use of the data.table package
library(data.table)
table <- data.table(a = c(1,NA,2,1), b= c(1,1,3,2))#creates the data table structure
table[,c:=ifelse(is.na(a),0,a+b)]#creates the column c based on the condition
> table
a b c
1: 1 1 2
2: NA 1 0
3: 2 3 5
4: 1 2 3

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

R $ operator is invalid for atomic vectors

I have a dataset where one of the columns are only "#" sign. I used the following code to remove this column.
ia <- as.data.frame(sapply(ia,gsub,pattern="#",replacement=""))
However, after this operation, one of the integer column I had changed to factor.
I wonder what happened and how can i avoid that. Appreciate it.
A more correct version of your code might be something like this:
d <- data.frame(x = as.character(1:5),y = c("a","b","#","c","d"))
> d[] <- lapply(d,gsub,pattern = "#",replace = "")
> d
x y
1 1 a
2 2 b
3 3
4 4 c
5 5 d
But as you'll note, this approach will never actually remove the offending column. It's just replacing the # values with empty character strings. To remove a column of all # you might do something like this:
d <- data.frame(x = as.character(1:5),
y = c("a","b","#","c","d"),
z = rep("#",5))
> d[,!sapply(d,function(x) all(x == "#"))]
x y
1 1 a
2 2 b
3 3 #
4 4 c
5 5 d
Surely if you want to remove an offending column from a data frame, and you know which column it is, you can just subset. So, if it's the first column:
df <- df[,-1]
If it's a later column, increment up.

Remove last N rows in data frame with the arbitrary number of rows

I have a data frame and I want to remove last N rows from it.
If I want to remove 5 rows, I currently use the following command, which in my opinion is rather convoluted:
df<- df[-seq(nrow(df),nrow(df)-4),]
How would you accomplish task, is there a convenient function that I can use in R?
In unix, I would use:
tac file | sed '1,5d' | tac
head with a negative index is convenient for this...
df <- data.frame( a = 1:10 )
head(df,-5)
# a
#1 1
#2 2
#3 3
#4 4
#5 5
p.s. your seq() example may be written slightly less(?) awkwardly using the named arguments by and length.out (shortened to len) like this -seq(nrow(df),by=-1,len=5).
This one takes one more line, but is far more readable:
n<-dim(df)[1]
df<-df[1:(n-5),]
Of course, you can do it in one line by sticking the dim command directly into the re-assignment statement.
I assume this is part of a reproducible script, and you can retrace your steps... Otherwise, strongly recommend in such cases to save to a different variable (e.g., df2) and then remove the redundant copy only after you're sure you got what you wanted.
Adding a dplyr answer for completeness:
test_df <- data_frame(a = c(1,2,3,4,5,6,7,8,9,10),
b = c("a","b","c","d","e","f","g","h","i","j"))
slice(test_df, 1:(n()-5))
## A tibble: 5 x 2
# a b
# <dbl> <chr>
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
Another dplyr answer which is even more readable:
df %>% filter(row_number() <= n()-5)

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Resources